[alsa-devel] ALSA calling pcm_pointer excessively?
I'm wondering what is causing ALSA to behave as follows - I have a 64KiB buffer, which is divided up into 4 16KiB periods. The log below is from my soc-dmaengine.c code which supports audio on the Assabet (as we don't have cyclic DMA support there yet.)
I'm just trying to sort out the DMA engine tx_status residue code, and am using this as proof that its working.
However, it seems that ALSA for whatever reason likes to hammer away at the pcm_pointer callback from time to time:
soc_dmaengine_pointer: index 3 stat 1 residue 0018 cookie 3cfe pos 10000 bytes ffe8 soc_dmaengine_pointer: index 3 stat 1 residue 0008 cookie 3cfe pos 10000 bytes fff8 soc_dmaengine_callback: complete soc_dmaengine_pointer: index 0 stat 1 residue 3ff0 cookie 3cff pos 4000 bytes 0010 soc_dmaengine_submit: submitting index 3 size 4000 cookie 3d02 soc_dmaengine_pointer: index 0 stat 1 residue 3fd0 cookie 3cff pos 4000 bytes 0030 soc_dmaengine_pointer: index 0 stat 1 residue 17b8 cookie 3cff pos 4000 bytes 2848 soc_dmaengine_pointer: index 0 stat 1 residue 1770 cookie 3cff pos 4000 bytes 2890 soc_dmaengine_pointer: index 0 stat 1 residue 1768 cookie 3cff pos 4000 bytes 2898 soc_dmaengine_pointer: index 0 stat 1 residue 1758 cookie 3cff pos 4000 bytes 28a8 soc_dmaengine_pointer: index 0 stat 1 residue 1748 cookie 3cff pos 4000 bytes 28b8 soc_dmaengine_pointer: index 0 stat 1 residue 1740 cookie 3cff pos 4000 bytes 28c0 soc_dmaengine_pointer: index 0 stat 1 residue 1730 cookie 3cff pos 4000 bytes 28d0 soc_dmaengine_pointer: index 0 stat 1 residue 1720 cookie 3cff pos 4000 bytes 28e0 soc_dmaengine_pointer: index 0 stat 1 residue 1718 cookie 3cff pos 4000 bytes 28e8 ... soc_dmaengine_pointer: index 0 stat 1 residue 0018 cookie 3cff pos 4000 bytes 3fe8 soc_dmaengine_pointer: index 0 stat 1 residue 0008 cookie 3cff pos 4000 bytes 3ff8 soc_dmaengine_callback: complete soc_dmaengine_pointer: index 1 stat 1 residue 3ff0 cookie 3d00 pos 8000 bytes 4010 soc_dmaengine_submit: submitting index 0 size 4000 cookie 3d03 soc_dmaengine_pointer: index 1 stat 1 residue 3fd0 cookie 3d00 pos 8000 bytes 4030 soc_dmaengine_pointer: index 1 stat 1 residue 3b28 cookie 3d00 pos 8000 bytes 44d8 soc_dmaengine_pointer: index 1 stat 1 residue 3b10 cookie 3d00 pos 8000 bytes 44f0 soc_dmaengine_pointer: index 1 stat 1 residue 3b00 cookie 3d00 pos 8000 bytes 4500
'pos' is the end of the current DMA buffer ((index + 1) * period_byte_size) 'residue' is the value returned from tx_status 'bytes' is the value passed into bytes_to_frames() to return from my pcm_pointer function.
The question is - what would cause ALSA to hammer away at the buffer like that - checking it every 16 or 8 DMA bytes transferred? I'm also seeing around 75% system CPU time, which really isn't good.
The above was captured while aplay was running, with no apparant audio corruption.
Any ideas?
On Thu, May 10, 2012 at 05:06:01PM +0100, Russell King - ARM Linux wrote:
I'm wondering what is causing ALSA to behave as follows - I have a 64KiB buffer, which is divided up into 4 16KiB periods. The log below is from my soc-dmaengine.c code which supports audio on the Assabet (as we don't have cyclic DMA support there yet.)
I'm just trying to sort out the DMA engine tx_status residue code, and am using this as proof that its working.
However, it seems that ALSA for whatever reason likes to hammer away at the pcm_pointer callback from time to time:
soc_dmaengine_pointer: index 3 stat 1 residue 0018 cookie 3cfe pos 10000 bytes ffe8 soc_dmaengine_pointer: index 3 stat 1 residue 0008 cookie 3cfe pos 10000 bytes fff8 soc_dmaengine_callback: complete soc_dmaengine_pointer: index 0 stat 1 residue 3ff0 cookie 3cff pos 4000 bytes 0010 soc_dmaengine_submit: submitting index 3 size 4000 cookie 3d02 soc_dmaengine_pointer: index 0 stat 1 residue 3fd0 cookie 3cff pos 4000 bytes 0030 soc_dmaengine_pointer: index 0 stat 1 residue 17b8 cookie 3cff pos 4000 bytes 2848 soc_dmaengine_pointer: index 0 stat 1 residue 1770 cookie 3cff pos 4000 bytes 2890 soc_dmaengine_pointer: index 0 stat 1 residue 1768 cookie 3cff pos 4000 bytes 2898 soc_dmaengine_pointer: index 0 stat 1 residue 1758 cookie 3cff pos 4000 bytes 28a8 soc_dmaengine_pointer: index 0 stat 1 residue 1748 cookie 3cff pos 4000 bytes 28b8 soc_dmaengine_pointer: index 0 stat 1 residue 1740 cookie 3cff pos 4000 bytes 28c0 soc_dmaengine_pointer: index 0 stat 1 residue 1730 cookie 3cff pos 4000 bytes 28d0 soc_dmaengine_pointer: index 0 stat 1 residue 1720 cookie 3cff pos 4000 bytes 28e0 soc_dmaengine_pointer: index 0 stat 1 residue 1718 cookie 3cff pos 4000 bytes 28e8 ... soc_dmaengine_pointer: index 0 stat 1 residue 0018 cookie 3cff pos 4000 bytes 3fe8 soc_dmaengine_pointer: index 0 stat 1 residue 0008 cookie 3cff pos 4000 bytes 3ff8 soc_dmaengine_callback: complete soc_dmaengine_pointer: index 1 stat 1 residue 3ff0 cookie 3d00 pos 8000 bytes 4010 soc_dmaengine_submit: submitting index 0 size 4000 cookie 3d03 soc_dmaengine_pointer: index 1 stat 1 residue 3fd0 cookie 3d00 pos 8000 bytes 4030 soc_dmaengine_pointer: index 1 stat 1 residue 3b28 cookie 3d00 pos 8000 bytes 44d8 soc_dmaengine_pointer: index 1 stat 1 residue 3b10 cookie 3d00 pos 8000 bytes 44f0 soc_dmaengine_pointer: index 1 stat 1 residue 3b00 cookie 3d00 pos 8000 bytes 4500
'pos' is the end of the current DMA buffer ((index + 1) * period_byte_size) 'residue' is the value returned from tx_status 'bytes' is the value passed into bytes_to_frames() to return from my pcm_pointer function.
The question is - what would cause ALSA to hammer away at the buffer like that - checking it every 16 or 8 DMA bytes transferred? I'm also seeing around 75% system CPU time, which really isn't good.
The above was captured while aplay was running, with no apparant audio corruption.
Any ideas?
bash-2.05a# cat /proc/asound/card0/pcm0p/sub0/sw_params tstamp_mode: NONE period_step: 1 avail_min: 4096 start_threshold: 16384 stop_threshold: 16384 silence_threshold: 0 silence_size: 0 boundary: 1073741824
I think what's happening is that snd_pcm_lib_write1() is looping, and each time it updates the hardware position, it finds that it can transfer 8 or 16 bytes to the buffer. Once it's transferred that, it re-updates the hardware position which has now advanced by another 8 or 16 bytes. Repeat, and you find that snd_pcm_lib_write1() spends a lot of time inefficiently copying the buffer.
It seems that it will only sleep if the hardware pointer stops making progress. That implies that I have to purposely reduce the granularity that I report the current DMA position which isn't a sane thing to do.
Russell King - ARM Linux wrote:
I think what's happening is that snd_pcm_lib_write1() is looping, and each time it updates the hardware position, it finds that it can transfer 8 or 16 bytes to the buffer. Once it's transferred that, it re-updates the hardware position which has now advanced by another 8 or 16 bytes. Repeat, and you find that snd_pcm_lib_write1() spends a lot of time inefficiently copying the buffer.
It seems that it will only sleep if the hardware pointer stops making progress.
I'd guess that most existing hardware is fast enough to transfer samples and has a big enough granularity (typically 32 byte bursts for PCI) that the pointer doesn't change in consecutive loop iterations.
This (untested) patch tries to avoid too many busy looping.
Regards, Clemens
--- a/sound/core/pcm_lib.c +++ b/sound/core/pcm_lib.c @@ -1894,7 +1894,7 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, struct snd_pcm_runtime *runtime = substream->runtime; snd_pcm_uframes_t xfer = 0; snd_pcm_uframes_t offset = 0; - int err = 0; + int busy_loops = 0, err = 0;
if (size == 0) return 0; @@ -1919,12 +1919,17 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, runtime->twake = runtime->control->avail_min ? : 1; while (size > 0) { snd_pcm_uframes_t frames, appl_ptr, appl_ofs; - snd_pcm_uframes_t avail; + snd_pcm_uframes_t avail, avail_wait_max; snd_pcm_uframes_t cont; + + if (busy_loops < 5) + avail_wait_max = 0; + else + avail_wait_max = min(runtime->control->avail_min, size); if (runtime->status->state == SNDRV_PCM_STATE_RUNNING) snd_pcm_update_hw_ptr(substream); avail = snd_pcm_playback_avail(runtime); - if (!avail) { + if (avail <= avail_wait_max) { if (nonblock) { err = -EAGAIN; goto _end_unlock; @@ -1934,6 +1939,9 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, err = wait_for_avail(substream, &avail); if (err < 0) goto _end_unlock; + busy_loops = 0; + } else { + busy_loops++; } frames = size > avail ? avail : size; cont = runtime->buffer_size - runtime->control->appl_ptr % runtime->buffer_size;
I wrote:
This (untested) patch tries to avoid too many busy looping.
Oops, off-by-one error (avail == avail_min should not wait).
Regards, Clemens
--- a/sound/core/pcm_lib.c +++ b/sound/core/pcm_lib.c @@ -1894,7 +1894,7 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, struct snd_pcm_runtime *runtime = substream->runtime; snd_pcm_uframes_t xfer = 0; snd_pcm_uframes_t offset = 0; - int err = 0; + int busy_loops = 0, err = 0;
if (size == 0) return 0; @@ -1919,12 +1919,17 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, runtime->twake = runtime->control->avail_min ? : 1; while (size > 0) { snd_pcm_uframes_t frames, appl_ptr, appl_ofs; - snd_pcm_uframes_t avail; + snd_pcm_uframes_t avail, avail_nowait_min; snd_pcm_uframes_t cont; + + if (busy_loops < 5) + avail_nowait_min = 1; + else + avail_nowait_min = min(runtime->control->avail_min, size); if (runtime->status->state == SNDRV_PCM_STATE_RUNNING) snd_pcm_update_hw_ptr(substream); avail = snd_pcm_playback_avail(runtime); - if (!avail) { + if (avail < avail_nowait_min) { if (nonblock) { err = -EAGAIN; goto _end_unlock; @@ -1934,6 +1939,9 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, err = wait_for_avail(substream, &avail); if (err < 0) goto _end_unlock; + busy_loops = 0; + } else { + busy_loops++; } frames = size > avail ? avail : size; cont = runtime->buffer_size - runtime->control->appl_ptr % runtime->buffer_size;
On Thu, May 10, 2012 at 07:13:59PM +0200, Clemens Ladisch wrote:
I wrote:
This (untested) patch tries to avoid too many busy looping.
Oops, off-by-one error (avail == avail_min should not wait).
This is better - it now sits around 2-14% system time. I guess the five loops is rather arbitary?
Russell King - ARM Linux wrote:
On Thu, May 10, 2012 at 07:13:59PM +0200, Clemens Ladisch wrote:
I wrote:
This (untested) patch tries to avoid too many busy looping.
Oops, off-by-one error (avail == avail_min should not wait).
This is better - it now sits around 2-14% system time. I guess the five loops is rather arbitary?
Yes. The minimum loop limit is two (to handle wraparound at the end of the buffer); more than that probably does not actually make sense.
Jassi Brar wrote:
If we really want to change the behavior of play, wouldn't we want to do that for capture(snd_pcm_lib_read1) as well ?
Indeed. And we'd want to reuse the twake calculation.
Regards, Clemens
On Thu, May 10, 2012 at 08:43:40PM +0200, Clemens Ladisch wrote:
Russell King - ARM Linux wrote:
On Thu, May 10, 2012 at 07:13:59PM +0200, Clemens Ladisch wrote:
I wrote:
This (untested) patch tries to avoid too many busy looping.
Oops, off-by-one error (avail == avail_min should not wait).
This is better - it now sits around 2-14% system time. I guess the five loops is rather arbitary?
Yes. The minimum loop limit is two (to handle wraparound at the end of the buffer); more than that probably does not actually make sense.
Ok, I'll test with two there tomorrow.
On Thu, May 10, 2012 at 10:35 PM, Clemens Ladisch clemens@ladisch.de wrote:
--- a/sound/core/pcm_lib.c +++ b/sound/core/pcm_lib.c @@ -1894,7 +1894,7 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, struct snd_pcm_runtime *runtime = substream->runtime; snd_pcm_uframes_t xfer = 0; snd_pcm_uframes_t offset = 0;
- int err = 0;
- int busy_loops = 0, err = 0;
if (size == 0) return 0; @@ -1919,12 +1919,17 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, runtime->twake = runtime->control->avail_min ? : 1; while (size > 0) { snd_pcm_uframes_t frames, appl_ptr, appl_ofs;
- snd_pcm_uframes_t avail;
- snd_pcm_uframes_t avail, avail_wait_max;
snd_pcm_uframes_t cont;
- if (busy_loops < 5)
- avail_wait_max = 0;
- else
- avail_wait_max = min(runtime->control->avail_min, size);
if (runtime->status->state == SNDRV_PCM_STATE_RUNNING) snd_pcm_update_hw_ptr(substream); avail = snd_pcm_playback_avail(runtime);
- if (!avail) {
- if (avail <= avail_wait_max) {
if (nonblock) { err = -EAGAIN; goto _end_unlock; @@ -1934,6 +1939,9 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, err = wait_for_avail(substream, &avail); if (err < 0) goto _end_unlock;
- busy_loops = 0;
- } else {
- busy_loops++;
} frames = size > avail ? avail : size; cont = runtime->buffer_size - runtime->control->appl_ptr % runtime->buffer_size;
Hi Clemens,
If we really want to change the behavior of play, wouldn't we want to do that for capture(snd_pcm_lib_read1) as well ?
At Thu, 10 May 2012 19:05:56 +0200, Clemens Ladisch wrote:
Russell King - ARM Linux wrote:
I think what's happening is that snd_pcm_lib_write1() is looping, and each time it updates the hardware position, it finds that it can transfer 8 or 16 bytes to the buffer. Once it's transferred that, it re-updates the hardware position which has now advanced by another 8 or 16 bytes. Repeat, and you find that snd_pcm_lib_write1() spends a lot of time inefficiently copying the buffer.
It seems that it will only sleep if the hardware pointer stops making progress.
I'd guess that most existing hardware is fast enough to transfer samples and has a big enough granularity (typically 32 byte bursts for PCI) that the pointer doesn't change in consecutive loop iterations.
This (untested) patch tries to avoid too many busy looping.
Hmm... I still can't follow why such a busy loop happens when avail_min > 1. If avail_min = 1, a busy loop can't be avoided. But if avail_min is set (typically equal with period_size), runtime->twake is either the rest size or the period size, and wait_for_avail() should wait until that sufficiently.
Russell, which avail_min value is used in your test case? (I thought aplay uses avail_min = period_size as default, but just to be sure.)
thanks,
Takashi
Regards, Clemens
--- a/sound/core/pcm_lib.c +++ b/sound/core/pcm_lib.c @@ -1894,7 +1894,7 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, struct snd_pcm_runtime *runtime = substream->runtime; snd_pcm_uframes_t xfer = 0; snd_pcm_uframes_t offset = 0;
- int err = 0;
int busy_loops = 0, err = 0;
if (size == 0) return 0;
@@ -1919,12 +1919,17 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, runtime->twake = runtime->control->avail_min ? : 1; while (size > 0) { snd_pcm_uframes_t frames, appl_ptr, appl_ofs;
snd_pcm_uframes_t avail;
snd_pcm_uframes_t cont;snd_pcm_uframes_t avail, avail_wait_max;
if (busy_loops < 5)
avail_wait_max = 0;
else
if (runtime->status->state == SNDRV_PCM_STATE_RUNNING) snd_pcm_update_hw_ptr(substream); avail = snd_pcm_playback_avail(runtime);avail_wait_max = min(runtime->control->avail_min, size);
if (!avail) {
if (avail <= avail_wait_max) { if (nonblock) { err = -EAGAIN; goto _end_unlock;
@@ -1934,6 +1939,9 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, err = wait_for_avail(substream, &avail); if (err < 0) goto _end_unlock;
busy_loops = 0;
} else {
} frames = size > avail ? avail : size; cont = runtime->buffer_size - runtime->control->appl_ptr % runtime->buffer_size;busy_loops++;
Alsa-devel mailing list Alsa-devel@alsa-project.org http://mailman.alsa-project.org/mailman/listinfo/alsa-devel
On Fri, May 11, 2012 at 03:31:08PM +0200, Takashi Iwai wrote:
At Thu, 10 May 2012 19:05:56 +0200, Clemens Ladisch wrote:
Russell King - ARM Linux wrote:
I think what's happening is that snd_pcm_lib_write1() is looping, and each time it updates the hardware position, it finds that it can transfer 8 or 16 bytes to the buffer. Once it's transferred that, it re-updates the hardware position which has now advanced by another 8 or 16 bytes. Repeat, and you find that snd_pcm_lib_write1() spends a lot of time inefficiently copying the buffer.
It seems that it will only sleep if the hardware pointer stops making progress.
I'd guess that most existing hardware is fast enough to transfer samples and has a big enough granularity (typically 32 byte bursts for PCI) that the pointer doesn't change in consecutive loop iterations.
This (untested) patch tries to avoid too many busy looping.
Hmm... I still can't follow why such a busy loop happens when avail_min > 1. If avail_min = 1, a busy loop can't be avoided. But if avail_min is set (typically equal with period_size), runtime->twake is either the rest size or the period size, and wait_for_avail() should wait until that sufficiently.
It's exaclty as I explained. Lets take avail_min = 4096, as it is in my case.
In snd_pcm_lib_write1(), we loop - here's the simplified version:
while (size > 0) { if (runtime->status->state == SNDRV_PCM_STATE_RUNNING) snd_pcm_update_hw_ptr(substream); avail = snd_pcm_playback_avail(runtime); if (!avail) { /* Sleep until avail >= avail_min */ } /* Transfer up to max(avail,size) bytes to buffer */ }
The problem is this. On entry to the loop, say, we calculate that we can transfer 4096 bytes into the buffer, and we do that. While we're transferring those bytes, the hardware DMA position has advanced.
So, the second time around the loop, avail will be non-zero - let's say that the DMA position has advanced 32 bytes. So avail will be 32 bytes, and we transfer 32 bytes. Again, while transferring those bytes, the DMA position has advanced, but not as far. Let's say 8 bytes.
So, we repeat the loop, and again, we find avail is non-zero. We omit sleeping, instead, dropping through to transfer 8 bytes.
Meanwhile, the DMA position has advanced another 8 bytes.
Repeat, endlessly, wasting lots of CPU cycles transferring very small sets of sample data to the buffer.
That happens because of the requirement for the above code to sleep is that there has been _no_ advancement of the DMA position while copying data to the buffer. Unfortunately, that's not always a realistic expectation, especially if you have a high data rate.
At no point in the above scenario does avail_min come into it, until we're lucky enough that we copy data into the buffer without the DMA position changing. At that point, we're then allowed to sleep. The decision to sleep is based purely upon there being _zero_ bytes of available buffer space.
At Fri, 11 May 2012 15:02:37 +0100, Russell King - ARM Linux wrote:
On Fri, May 11, 2012 at 03:31:08PM +0200, Takashi Iwai wrote:
At Thu, 10 May 2012 19:05:56 +0200, Clemens Ladisch wrote:
Russell King - ARM Linux wrote:
I think what's happening is that snd_pcm_lib_write1() is looping, and each time it updates the hardware position, it finds that it can transfer 8 or 16 bytes to the buffer. Once it's transferred that, it re-updates the hardware position which has now advanced by another 8 or 16 bytes. Repeat, and you find that snd_pcm_lib_write1() spends a lot of time inefficiently copying the buffer.
It seems that it will only sleep if the hardware pointer stops making progress.
I'd guess that most existing hardware is fast enough to transfer samples and has a big enough granularity (typically 32 byte bursts for PCI) that the pointer doesn't change in consecutive loop iterations.
This (untested) patch tries to avoid too many busy looping.
Hmm... I still can't follow why such a busy loop happens when avail_min > 1. If avail_min = 1, a busy loop can't be avoided. But if avail_min is set (typically equal with period_size), runtime->twake is either the rest size or the period size, and wait_for_avail() should wait until that sufficiently.
It's exaclty as I explained. Lets take avail_min = 4096, as it is in my case.
In snd_pcm_lib_write1(), we loop - here's the simplified version:
while (size > 0) { if (runtime->status->state == SNDRV_PCM_STATE_RUNNING) snd_pcm_update_hw_ptr(substream); avail = snd_pcm_playback_avail(runtime); if (!avail) { /* Sleep until avail >= avail_min */ } /* Transfer up to max(avail,size) bytes to buffer */
}
The problem is this. On entry to the loop, say, we calculate that we can transfer 4096 bytes into the buffer, and we do that. While we're transferring those bytes, the hardware DMA position has advanced.
So, the second time around the loop, avail will be non-zero - let's say that the DMA position has advanced 32 bytes. So avail will be 32 bytes, and we transfer 32 bytes. Again, while transferring those bytes, the DMA position has advanced, but not as far. Let's say 8 bytes.
So, we repeat the loop, and again, we find avail is non-zero. We omit sleeping, instead, dropping through to transfer 8 bytes.
Meanwhile, the DMA position has advanced another 8 bytes.
Repeat, endlessly, wasting lots of CPU cycles transferring very small sets of sample data to the buffer.
That happens because of the requirement for the above code to sleep is that there has been _no_ advancement of the DMA position while copying data to the buffer. Unfortunately, that's not always a realistic expectation, especially if you have a high data rate.
At no point in the above scenario does avail_min come into it, until we're lucky enough that we copy data into the buffer without the DMA position changing. At that point, we're then allowed to sleep. The decision to sleep is based purely upon there being _zero_ bytes of available buffer space.
Ah, thanks for clearing, now I got it.
Actually the call of snd_pcm_update_hw_ptr() at the loop entry doesn't have to be done at each time, supposing the process of the loop (without wait_for_avail()) is fast enough.
How about the patch below?
Takashi
--- diff --git a/sound/core/pcm_lib.c b/sound/core/pcm_lib.c index 4d18941..faedb14 100644 --- a/sound/core/pcm_lib.c +++ b/sound/core/pcm_lib.c @@ -1894,6 +1894,7 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, struct snd_pcm_runtime *runtime = substream->runtime; snd_pcm_uframes_t xfer = 0; snd_pcm_uframes_t offset = 0; + snd_pcm_uframes_t avail; int err = 0;
if (size == 0) @@ -1917,13 +1918,12 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, }
runtime->twake = runtime->control->avail_min ? : 1; + if (runtime->status->state == SNDRV_PCM_STATE_RUNNING) + snd_pcm_update_hw_ptr(substream); + avail = snd_pcm_playback_avail(runtime); while (size > 0) { snd_pcm_uframes_t frames, appl_ptr, appl_ofs; - snd_pcm_uframes_t avail; snd_pcm_uframes_t cont; - if (runtime->status->state == SNDRV_PCM_STATE_RUNNING) - snd_pcm_update_hw_ptr(substream); - avail = snd_pcm_playback_avail(runtime); if (!avail) { if (nonblock) { err = -EAGAIN; @@ -1971,6 +1971,7 @@ static snd_pcm_sframes_t snd_pcm_lib_write1(struct snd_pcm_substream *substream, offset += frames; size -= frames; xfer += frames; + avail -= frames; if (runtime->status->state == SNDRV_PCM_STATE_PREPARED && snd_pcm_playback_hw_avail(runtime) >= (snd_pcm_sframes_t)runtime->start_threshold) { err = snd_pcm_start(substream); @@ -2111,6 +2112,7 @@ static snd_pcm_sframes_t snd_pcm_lib_read1(struct snd_pcm_substream *substream, struct snd_pcm_runtime *runtime = substream->runtime; snd_pcm_uframes_t xfer = 0; snd_pcm_uframes_t offset = 0; + snd_pcm_uframes_t avail; int err = 0;
if (size == 0) @@ -2141,13 +2143,12 @@ static snd_pcm_sframes_t snd_pcm_lib_read1(struct snd_pcm_substream *substream, }
runtime->twake = runtime->control->avail_min ? : 1; + if (runtime->status->state == SNDRV_PCM_STATE_RUNNING) + snd_pcm_update_hw_ptr(substream); + avail = snd_pcm_capture_avail(runtime); while (size > 0) { snd_pcm_uframes_t frames, appl_ptr, appl_ofs; - snd_pcm_uframes_t avail; snd_pcm_uframes_t cont; - if (runtime->status->state == SNDRV_PCM_STATE_RUNNING) - snd_pcm_update_hw_ptr(substream); - avail = snd_pcm_capture_avail(runtime); if (!avail) { if (runtime->status->state == SNDRV_PCM_STATE_DRAINING) { @@ -2202,6 +2203,7 @@ static snd_pcm_sframes_t snd_pcm_lib_read1(struct snd_pcm_substream *substream, offset += frames; size -= frames; xfer += frames; + avail -= frames; } _end_unlock: runtime->twake = 0;
On Fri, May 11, 2012 at 04:17:50PM +0200, Takashi Iwai wrote:
Actually the call of snd_pcm_update_hw_ptr() at the loop entry doesn't have to be done at each time, supposing the process of the loop (without wait_for_avail()) is fast enough.
How about the patch below?
This too seems to work fine on playback. Thanks.
At Fri, 11 May 2012 16:04:50 +0100, Russell King - ARM Linux wrote:
On Fri, May 11, 2012 at 04:17:50PM +0200, Takashi Iwai wrote:
Actually the call of snd_pcm_update_hw_ptr() at the loop entry doesn't have to be done at each time, supposing the process of the loop (without wait_for_avail()) is fast enough.
How about the patch below?
This too seems to work fine on playback. Thanks.
Good to hear. I queued it for linux-next now.
thanks,
Takashi
On Fri, May 11, 2012 at 06:31:53PM +0200, Takashi Iwai wrote:
At Fri, 11 May 2012 16:04:50 +0100, Russell King - ARM Linux wrote:
On Fri, May 11, 2012 at 04:17:50PM +0200, Takashi Iwai wrote:
Actually the call of snd_pcm_update_hw_ptr() at the loop entry doesn't have to be done at each time, supposing the process of the loop (without wait_for_avail()) is fast enough.
How about the patch below?
This too seems to work fine on playback. Thanks.
Good to hear. I queued it for linux-next now.
Please ensure that it gets:
Tested-by: Russell King rmk+kernel@arm.linux.org.uk
and not with this address. Thanks.
At Fri, 11 May 2012 17:38:44 +0100, Russell King - ARM Linux wrote:
On Fri, May 11, 2012 at 06:31:53PM +0200, Takashi Iwai wrote:
At Fri, 11 May 2012 16:04:50 +0100, Russell King - ARM Linux wrote:
On Fri, May 11, 2012 at 04:17:50PM +0200, Takashi Iwai wrote:
Actually the call of snd_pcm_update_hw_ptr() at the loop entry doesn't have to be done at each time, supposing the process of the loop (without wait_for_avail()) is fast enough.
How about the patch below?
This too seems to work fine on playback. Thanks.
Good to hear. I queued it for linux-next now.
Please ensure that it gets:
Tested-by: Russell King rmk+kernel@arm.linux.org.uk
and not with this address. Thanks.
I didn't add your tested-by tag since it wasn't explicitly given, but now added with the right address. Thanks.
Takashi
On Thu, May 10, 2012 at 12:06 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
The question is - what would cause ALSA to hammer away at the buffer like that - checking it every 16 or 8 DMA bytes transferred? I'm also seeing around 75% system CPU time, which really isn't good.
The above was captured while aplay was running, with no apparant audio corruption.
I've had this happen, but not with aplay. I sent something to the list a few months ago about it.
The problem is triggered when writing to ALSA with writes that are unequal to the period size and using sampling rate conversion in an ALSA plugin. Code in alsa-lib will get into a loop calling poll() on the pcm fd, and ALSA's poll() calls the pointer method to figure out if there is enough buffer space free to return or if it poll() should block.
That is all fine, but rather than write more data when poll() returns, alsa-lib decides to call poll() again. So it's stuck in a tight loop calling poll() which isn't blocking. Eventually there will be enough space in the buffer that alsa will write some data and poll() will block.
Basically poll() will return when there is at least X bytes available in the buffer. X = avail_min. ALSA will only write more data when there is at least Y bytes free in the buffer. The bug is X < Y. This only happens when sample rate conversion and only if the app doesn't call snd_pcm_writei() with full periods. While the available space in the buffer is > X and < Y, ALSA is in a tight loop calling poll() which calls the PCM's pointer method. If X == Y, then this would never happen.
On Thu, May 10, 2012 at 12:34:19PM -0400, Trent Piepho wrote:
On Thu, May 10, 2012 at 12:06 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
The question is - what would cause ALSA to hammer away at the buffer like that - checking it every 16 or 8 DMA bytes transferred? I'm also seeing around 75% system CPU time, which really isn't good.
The above was captured while aplay was running, with no apparant audio corruption.
I've had this happen, but not with aplay. I sent something to the list a few months ago about it.
The problem is triggered when writing to ALSA with writes that are unequal to the period size and using sampling rate conversion in an ALSA plugin.
I'm not using any plugins, and there's definitely no sample rate conversion going on. I'm basically piping a raw S16_LE stereo file into "aplay -c2 -r48000 -fS16_LE" and that's the only format the hardware supports:
Playing raw data 'stdin' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo Plug PCM: Hardware PCM card 0 'Assabet' device 0 subdevice 0
Its setup is: stream : PLAYBACK access : RW_INTERLEAVED format : S16_LE subformat : STD channels : 2 rate : 48000 exact rate : 48000 (48000/1) msbits : 16 buffer_size : 16384 period_size : 4096 period_time : 85333 tick_time : 0 tstamp_mode : NONE period_step : 1 sleep_min : 0 avail_min : 4096 xfer_align : 4096 start_threshold : 16384 stop_threshold : 16384 silence_threshold: 0 silence_size : 0 boundary : 1073741824
Note that this is not using mmap, as ARM is not cache coherent.
Interestingly, if I use -M to aplay to force it to use mmap, things get a _lot_ better and I only get a lot fewer calls to the pcm_pointer() method per period. I've included in this log the DMA engine output (which shows the raw DMA address - the buffer is 64K aligned) and where the pcm_pointer() method was called from.
tx_status: addr:c0f14010 soc_dmaengine_pointer: index 1 stat 1 residue 3ff0 cookie 1792 pos 8000 bytes 4010 snd_pcm_update_hw_ptr0+0x30/0x24c soc_dmaengine_submit: submitting index 0 size 4000 cookie 1795 tx_status: addr:c0f14060 soc_dmaengine_pointer: index 1 stat 1 residue 3fa0 cookie 1792 pos 8000 bytes 4060 snd_pcm_update_hw_ptr0+0x30/0x24c tx_status: addr:c0f14130 soc_dmaengine_pointer: index 1 stat 1 residue 3ed0 cookie 1792 pos 8000 bytes 4130 snd_pcm_update_hw_ptr0+0x30/0x24c soc_dmaengine_callback: complete tx_status: addr:c0f18010 soc_dmaengine_pointer: index 2 stat 1 residue 3ff0 cookie 1793 pos c000 bytes 8010 snd_pcm_update_hw_ptr0+0x30/0x24c soc_dmaengine_submit: submitting index 1 size 4000 cookie 1796 tx_status: addr:c0f18060 soc_dmaengine_pointer: index 2 stat 1 residue 3fa0 cookie 1793 pos c000 bytes 8060 snd_pcm_update_hw_ptr0+0x30/0x24c tx_status: addr:c0f18150 soc_dmaengine_pointer: index 2 stat 1 residue 3eb0 cookie 1793 pos c000 bytes 8150 snd_pcm_update_hw_ptr0+0x30/0x24c soc_dmaengine_callback: complete tx_status: addr:c0f1c010 soc_dmaengine_pointer: index 3 stat 1 residue 3ff0 cookie 1794 pos 10000 bytes c010 snd_pcm_update_hw_ptr0+0x30/0x24c soc_dmaengine_submit: submitting index 2 size 4000 cookie 1797 tx_status: addr:c0f1c060 soc_dmaengine_pointer: index 3 stat 1 residue 3fa0 cookie 1794 pos 10000 bytes c060 snd_pcm_update_hw_ptr0+0x30/0x24c tx_status: addr:c0f1c130 soc_dmaengine_pointer: index 3 stat 1 residue 3ed0 cookie 1794 pos 10000 bytes c130 snd_pcm_update_hw_ptr0+0x30/0x24c soc_dmaengine_callback: complete tx_status: addr:c0f10068 soc_dmaengine_pointer: index 0 stat 1 residue 3f98 cookie 1795 pos 4000 bytes 0068 snd_pcm_update_hw_ptr0+0x30/0x24c soc_dmaengine_submit: submitting index 3 size 4000 cookie 1798 tx_status: addr:c0f100f0 soc_dmaengine_pointer: index 0 stat 1 residue 3f10 cookie 1795 pos 4000 bytes 00f0 snd_pcm_update_hw_ptr0+0x30/0x24c tx_status: addr:c0f101c8 soc_dmaengine_pointer: index 0 stat 1 residue 3e38 cookie 1795 pos 4000 bytes 01c8 snd_pcm_update_hw_ptr0+0x30/0x24c soc_dmaengine_callback: complete tx_status: addr:c0f14010 soc_dmaengine_pointer: index 1 stat 1 residue 3ff0 cookie 1796 pos 8000 bytes 4010 snd_pcm_update_hw_ptr0+0x30/0x24c soc_dmaengine_submit: submitting index 0 size 4000 cookie 1799 tx_status: addr:c0f14060 soc_dmaengine_pointer: index 1 stat 1 residue 3fa0 cookie 1796 pos 8000 bytes 4060 snd_pcm_update_hw_ptr0+0x30/0x24c tx_status: addr:c0f14130 soc_dmaengine_pointer: index 1 stat 1 residue 3ed0 cookie 1796 pos 8000 bytes 4130 snd_pcm_update_hw_ptr0+0x30/0x24c
On Thu, May 10, 2012 at 9:36 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
The question is - what would cause ALSA to hammer away at the buffer like that - checking it every 16 or 8 DMA bytes transferred? I'm also seeing around 75% system CPU time, which really isn't good.
The above was captured while aplay was running, with no apparant audio corruption.
Any ideas?
I believe it's because the pcm's write tries to keep the data in ring buffer filled upto brim and keeps polling the dma position till it keeps reporting some progress.
I think if you could avoid that behavior if your driver carefully keeps negligible progresses to itself and report only considerable ones. Basically, core would stop polling and wait on queue when two pointer calls return same value i.e no progress.
On Thu, May 10, 2012 at 1:00 PM, Jassi Brar jassisinghbrar@gmail.com wrote:
On Thu, May 10, 2012 at 9:36 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
The question is - what would cause ALSA to hammer away at the buffer like that - checking it every 16 or 8 DMA bytes transferred? I'm also seeing around 75% system CPU time, which really isn't good.
The above was captured while aplay was running, with no apparant audio corruption.
Any ideas?
I believe it's because the pcm's write tries to keep the data in ring buffer filled upto brim and keeps polling the dma position till it keeps reporting some progress.
Seems like it should sleep until avail_min frames are empty?
On Thu, May 10, 2012 at 10:30:44PM +0530, Jassi Brar wrote:
On Thu, May 10, 2012 at 9:36 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
The question is - what would cause ALSA to hammer away at the buffer like that - checking it every 16 or 8 DMA bytes transferred? I'm also seeing around 75% system CPU time, which really isn't good.
The above was captured while aplay was running, with no apparant audio corruption.
Any ideas?
I believe it's because the pcm's write tries to keep the data in ring buffer filled upto brim and keeps polling the dma position till it keeps reporting some progress.
I think if you could avoid that behavior if your driver carefully keeps negligible progresses to itself and report only considerable ones. Basically, core would stop polling and wait on queue when two pointer calls return same value i.e no progress.
You _could_ but why should drivers work around silly behaviour in the ALSA core?
On Thu, May 10, 2012 at 10:34 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Thu, May 10, 2012 at 10:30:44PM +0530, Jassi Brar wrote:
On Thu, May 10, 2012 at 9:36 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
The question is - what would cause ALSA to hammer away at the buffer like that - checking it every 16 or 8 DMA bytes transferred? I'm also seeing around 75% system CPU time, which really isn't good.
The above was captured while aplay was running, with no apparant audio corruption.
Any ideas?
I believe it's because the pcm's write tries to keep the data in ring buffer filled upto brim and keeps polling the dma position till it keeps reporting some progress.
I think if you could avoid that behavior if your driver carefully keeps negligible progresses to itself and report only considerable ones. Basically, core would stop polling and wait on queue when two pointer calls return same value i.e no progress.
You _could_ but why should drivers work around silly behaviour in the ALSA core?
One reason I could think of is to save us from under-runs for as long as possible.
participants (5)
-
Clemens Ladisch
-
Jassi Brar
-
Russell King - ARM Linux
-
Takashi Iwai
-
Trent Piepho