[alsa-devel] Timer instability

older
[alsa-devel] [PATCH] snd-powermac:...

Lennart Poettering

19 Feb 2009 19 Feb '09

3:46 a.m.

Heya!

As you might now PulseAudio schedules audio by system timers, not by sound card interrupts. Unfortunately there are are couple of issues with the reliability of the timing/buffering information we get from ALSA.

For example, on an ens1371 I have tracked down the following issue: I use the full buffer that the sound card provides (64K, i.e. 371ms). I sleep for about 350 ms, then fill up again, and sleep 350ms, and so and so on. Everytime I wake up I check snd_pcm_avail() and write exactly as much as this call tells me to. I also listen to POLLOUT events on the fd, however I call snd_pcm_sw_params_set_period_event(, 0) for it to make sure I actually don't get any real events on that.

After a couple of minutes of running this loop I can reliably reproduce that _avail() tells me I should fill the buffer up, I do so, query it again and it shows the buffer is filled up. I then go to sleep but immediately get POLLIN (although I shouldnt, given that I called snd_pcm_sw_params_set_period_event()), wake up again, call snd_pcm_avail() and according to it the buffer ran completely empty. The poll() took less than 1 ms time, and the last thing I did before going to sleep was checking that the buffer was full. So certainly ALSA is lying to me here... The buffer couldn't have run empty in less than 1 ms!

The same with real numbers as one example:

1. ... we are woken up 2. _avail() returns 15496 samples (i.e. 350ms to fill up) 3. we gather and write 15496 samples 4. _avail() returns 48 samples, which we consider "filled up enough" to go to sleep 5. we call poll() 6. after < 1ms we are woken up by a POLLIN (which we actually thought we had explicitly disabled) 7. _avail() returns 16448 samples (i.e. 372ms to fill up -- larger than the actual buffer of 16384 samples!)

This smells a lot like some kind of overflow to me, given that in every case where I could reproduce this it was possible to 'fix' the issue by substracting the buffer size from the _avail() after the poll(). After that substraction the value started to make sense again. (i.e. in the example above it then becomes 64 samples, which is reasonable after sleep less than 1ms I guess).

The stop threshold is set to the boundary here btw.

If necessary I could extract a test case for this, but my hope that you guys might have an idea what goes on without a test case, given that this smells so "overflowy" ;-)

And of course, how come ALSA triggers POLLIN if I asked it not to?

This is alsa-libs 1.0.19 on a 2.6.27 kernel.

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Show replies by date

Takashi Iwai

19 Feb 19 Feb

7:52 a.m.

At Thu, 19 Feb 2009 03:46:11 +0100, Lennart Poettering wrote:

...

Heya!

As you might now PulseAudio schedules audio by system timers, not by sound card interrupts. Unfortunately there are are couple of issues with the reliability of the timing/buffering information we get from ALSA.

For example, on an ens1371 I have tracked down the following issue: I use the full buffer that the sound card provides (64K, i.e. 371ms). I sleep for about 350 ms, then fill up again, and sleep 350ms, and so and so on. Everytime I wake up I check snd_pcm_avail() and write exactly as much as this call tells me to. I also listen to POLLOUT events on the fd, however I call snd_pcm_sw_params_set_period_event(, 0) for it to make sure I actually don't get any real events on that.

After a couple of minutes of running this loop I can reliably reproduce that _avail() tells me I should fill the buffer up, I do so, query it again and it shows the buffer is filled up. I then go to sleep but immediately get POLLIN (although I shouldnt, given that I called snd_pcm_sw_params_set_period_event()), wake up again, call snd_pcm_avail() and according to it the buffer ran completely empty. The poll() took less than 1 ms time, and the last thing I did before going to sleep was checking that the buffer was full. So certainly ALSA is lying to me here... The buffer couldn't have run empty in less than 1 ms!

The same with real numbers as one example:

... we are woken up

_avail() returns 15496 samples (i.e. 350ms to fill up)

we gather and write 15496 samples

_avail() returns 48 samples, which we consider "filled up enough" to go to sleep

we call poll()

after < 1ms we are woken up by a POLLIN (which we actually thought we had explicitly disabled)

_avail() returns 16448 samples (i.e. 372ms to fill up -- larger than the actual buffer of 16384 samples!)

This smells a lot like some kind of overflow to me, given that in every case where I could reproduce this it was possible to 'fix' the issue by substracting the buffer size from the _avail() after the poll(). After that substraction the value started to make sense again. (i.e. in the example above it then becomes 64 samples, which is reasonable after sleep less than 1ms I guess).

The stop threshold is set to the boundary here btw.

If necessary I could extract a test case for this, but my hope that you guys might have an idea what goes on without a test case, given that this smells so "overflowy" ;-)

And of course, how come ALSA triggers POLLIN if I asked it not to?

Are you using *_period_event() stuff? Unless you use dmix, this is the only place where strange POLLIN may come in. There are many new hacks there...

Try to run without that. Also make sure that you use the very latest alsa-lib snapshot.

Takashi

Lennart Poettering

20 Feb 20 Feb

2:22 a.m.

On Thu, 19.02.09 07:52, Takashi Iwai (tiwai@suse.de) wrote:

...

At Thu, 19 Feb 2009 03:46:11 +0100, Lennart Poettering wrote:

...
Heya!

As you might now PulseAudio schedules audio by system timers, not by sound card interrupts. Unfortunately there are are couple of issues with the reliability of the timing/buffering information we get from ALSA.

For example, on an ens1371 I have tracked down the following issue: I use the full buffer that the sound card provides (64K, i.e. 371ms). I sleep for about 350 ms, then fill up again, and sleep 350ms, and so and so on. Everytime I wake up I check snd_pcm_avail() and write exactly as much as this call tells me to. I also listen to POLLOUT events on the fd, however I call snd_pcm_sw_params_set_period_event(, 0) for it to make sure I actually don't get any real events on that.

After a couple of minutes of running this loop I can reliably reproduce that _avail() tells me I should fill the buffer up, I do so, query it again and it shows the buffer is filled up. I then go to sleep but immediately get POLLIN (although I shouldnt, given that I called snd_pcm_sw_params_set_period_event()), wake up again, call snd_pcm_avail() and according to it the buffer ran completely empty. The poll() took less than 1 ms time, and the last thing I did before going to sleep was checking that the buffer was full. So certainly ALSA is lying to me here... The buffer couldn't have run empty in less than 1 ms!

The same with real numbers as one example:

... we are woken up

_avail() returns 15496 samples (i.e. 350ms to fill up)

we gather and write 15496 samples

_avail() returns 48 samples, which we consider "filled up enough" to go to sleep

we call poll()

after < 1ms we are woken up by a POLLIN (which we actually thought we had explicitly disabled)

_avail() returns 16448 samples (i.e. 372ms to fill up -- larger than the actual buffer of 16384 samples!)

This smells a lot like some kind of overflow to me, given that in every case where I could reproduce this it was possible to 'fix' the issue by substracting the buffer size from the _avail() after the poll(). After that substraction the value started to make sense again. (i.e. in the example above it then becomes 64 samples, which is reasonable after sleep less than 1ms I guess).

The stop threshold is set to the boundary here btw.

If necessary I could extract a test case for this, but my hope that you guys might have an idea what goes on without a test case, given that this smells so "overflowy" ;-)

And of course, how come ALSA triggers POLLIN if I asked it not to?

Are you using *_period_event() stuff?

Yes, I am. But ignore that part for now. I have now commented the use of that call. Now I certainly get more POLLOUTs as expected, but the real problem stays: after a few minutes _avail() will suddenly jump from next to zero to more then the hwbuf size in less than 1ms without any further inteference and with a buffer size of 350ms! There is something really wrong with the behaviour of _avail().

I can reproduce this only on ens1371 for now.

...

Unless you use dmix, this is the only place where strange POLLIN may come in. There are many new hacks there...

There is no plug in used but "hw".

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Lennart Poettering

2:26 a.m.

On Fri, 20.02.09 02:22, Lennart Poettering (mznyfn@0pointer.de) wrote:

...

On Thu, 19.02.09 07:52, Takashi Iwai (tiwai@suse.de) wrote:

...
At Thu, 19 Feb 2009 03:46:11 +0100, Lennart Poettering wrote:

...
Heya!

As you might now PulseAudio schedules audio by system timers, not by sound card interrupts. Unfortunately there are are couple of issues with the reliability of the timing/buffering information we get from ALSA.

For example, on an ens1371 I have tracked down the following issue: I use the full buffer that the sound card provides (64K, i.e. 371ms). I sleep for about 350 ms, then fill up again, and sleep 350ms, and so and so on. Everytime I wake up I check snd_pcm_avail() and write exactly as much as this call tells me to. I also listen to POLLOUT events on the fd, however I call snd_pcm_sw_params_set_period_event(, 0) for it to make sure I actually don't get any real events on that.

After a couple of minutes of running this loop I can reliably reproduce that _avail() tells me I should fill the buffer up, I do so, query it again and it shows the buffer is filled up. I then go to sleep but immediately get POLLIN (although I shouldnt, given that I called snd_pcm_sw_params_set_period_event()), wake up again, call snd_pcm_avail() and according to it the buffer ran completely empty. The poll() took less than 1 ms time, and the last thing I did before going to sleep was checking that the buffer was full. So certainly ALSA is lying to me here... The buffer couldn't have run empty in less than 1 ms!

The same with real numbers as one example:

... we are woken up

_avail() returns 15496 samples (i.e. 350ms to fill up)

we gather and write 15496 samples

_avail() returns 48 samples, which we consider "filled up enough" to go to sleep

we call poll()

after < 1ms we are woken up by a POLLIN (which we actually thought we had explicitly disabled)

_avail() returns 16448 samples (i.e. 372ms to fill up -- larger than the actual buffer of 16384 samples!)

This smells a lot like some kind of overflow to me, given that in every case where I could reproduce this it was possible to 'fix' the issue by substracting the buffer size from the _avail() after the poll(). After that substraction the value started to make sense again. (i.e. in the example above it then becomes 64 samples, which is reasonable after sleep less than 1ms I guess).

The stop threshold is set to the boundary here btw.

If necessary I could extract a test case for this, but my hope that you guys might have an idea what goes on without a test case, given that this smells so "overflowy" ;-)

And of course, how come ALSA triggers POLLIN if I asked it not to?

Are you using *_period_event() stuff?

Yes, I am. But ignore that part for now. I have now commented the use of that call. Now I certainly get more POLLOUTs as expected, but the real problem stays: after a few minutes _avail() will suddenly jump from next to zero to more then the hwbuf size in less than 1ms without any further inteference and with a buffer size of 350ms! There is something really wrong with the behaviour of _avail().

I can reproduce this only on ens1371 for now.

Oh, and upgrading to 2.6.29-rc5 had no effect either. Problem stays.

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Lennart Poettering

2:50 a.m.

On Fri, 20.02.09 02:22, Lennart Poettering (mznyfn@0pointer.de) wrote:

...

...
...
If necessary I could extract a test case for this, but my hope that you guys might have an idea what goes on without a test case, given that this smells so "overflowy" ;-)

And of course, how come ALSA triggers POLLIN if I asked it not to?

Are you using *_period_event() stuff?

Yes, I am. But ignore that part for now. I have now commented the use of that call. Now I certainly get more POLLOUTs as expected, but the real problem stays: after a few minutes _avail() will suddenly jump from next to zero to more then the hwbuf size in less than 1ms without any further inteference and with a buffer size of 350ms! There is something really wrong with the behaviour of _avail().

I can reproduce this only on ens1371 for now.

Actually it is worse than just the issue mentioned above. I also can reproduce that _avail() tells me the device is almost completely full (i.e. 370ms of data), and when i then go to sleep for 350ms then afterwards _avail() will tell me that about 700ms are missing. From the context the first _avail() was just wrong.

In summary: _avail() is completely unreliable on ens1371. It sometimes returns values that are too high by once the buffer size, and sometimes too low by once the buffer size.

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Takashi Iwai

8:26 a.m.

At Fri, 20 Feb 2009 02:22:20 +0100, Lennart Poettering wrote:

...

On Thu, 19.02.09 07:52, Takashi Iwai (tiwai@suse.de) wrote:

...
At Thu, 19 Feb 2009 03:46:11 +0100, Lennart Poettering wrote:

...
Heya!

As you might now PulseAudio schedules audio by system timers, not by sound card interrupts. Unfortunately there are are couple of issues with the reliability of the timing/buffering information we get from ALSA.

For example, on an ens1371 I have tracked down the following issue: I use the full buffer that the sound card provides (64K, i.e. 371ms). I sleep for about 350 ms, then fill up again, and sleep 350ms, and so and so on. Everytime I wake up I check snd_pcm_avail() and write exactly as much as this call tells me to. I also listen to POLLOUT events on the fd, however I call snd_pcm_sw_params_set_period_event(, 0) for it to make sure I actually don't get any real events on that.

After a couple of minutes of running this loop I can reliably reproduce that _avail() tells me I should fill the buffer up, I do so, query it again and it shows the buffer is filled up. I then go to sleep but immediately get POLLIN (although I shouldnt, given that I called snd_pcm_sw_params_set_period_event()), wake up again, call snd_pcm_avail() and according to it the buffer ran completely empty. The poll() took less than 1 ms time, and the last thing I did before going to sleep was checking that the buffer was full. So certainly ALSA is lying to me here... The buffer couldn't have run empty in less than 1 ms!

The same with real numbers as one example:

... we are woken up

_avail() returns 15496 samples (i.e. 350ms to fill up)

we gather and write 15496 samples

_avail() returns 48 samples, which we consider "filled up enough" to go to sleep

we call poll()

after < 1ms we are woken up by a POLLIN (which we actually thought we had explicitly disabled)

_avail() returns 16448 samples (i.e. 372ms to fill up -- larger than the actual buffer of 16384 samples!)

This smells a lot like some kind of overflow to me, given that in every case where I could reproduce this it was possible to 'fix' the issue by substracting the buffer size from the _avail() after the poll(). After that substraction the value started to make sense again. (i.e. in the example above it then becomes 64 samples, which is reasonable after sleep less than 1ms I guess).

The stop threshold is set to the boundary here btw.

If necessary I could extract a test case for this, but my hope that you guys might have an idea what goes on without a test case, given that this smells so "overflowy" ;-)

And of course, how come ALSA triggers POLLIN if I asked it not to?

Are you using *_period_event() stuff?

Yes, I am. But ignore that part for now. I have now commented the use of that call. Now I certainly get more POLLOUTs as expected, but the real problem stays: after a few minutes _avail() will suddenly jump from next to zero to more then the hwbuf size in less than 1ms without any further inteference and with a buffer size of 350ms! There is something really wrong with the behaviour of _avail().

I can reproduce this only on ens1371 for now.

The ens1371 driver itself is damn simple. The pointer callback just returns the read value. So, it implies that it's basically a hardware issue.

Does the patch below have any influence?

...

...
Unless you use dmix, this is the only place where strange POLLIN may come in. There are many new hacks there...

There is no plug in used but "hw".

Yes but *_period_event() is implemented in "hw".

thanks,

Takashi

--- diff --git a/sound/pci/ens1370.c b/sound/pci/ens1370.c index 18f4d1e..3a7739b 100644 --- a/sound/pci/ens1370.c +++ b/sound/pci/ens1370.c @@ -991,15 +991,36 @@ static int snd_ensoniq_capture_prepare(struct snd_pcm_substream *substream) return 0; }

+#define MAX_PTR_TIMEOUT 50 + +static unsigned int get_cur_ptr(struct ensoniq *ensoniq, unsigned int map, + unsigned int reg) +{ + unsigned int optr, ptr; + int t; + + outl(map, ES_REG(ensoniq, MEM_PAGE)); + inl(ES_REG(ensoniq, MEM_PAGE)); + optr = inl(reg); + for (t = 0; t < MAX_PTR_TIMEOUT; t++) { + ptr = inl(reg); + if (ptr == optr) + return ptr; + optr = ptr; + } + return ptr; +} + static snd_pcm_uframes_t snd_ensoniq_playback1_pointer(struct snd_pcm_substream *substream) { struct ensoniq *ensoniq = snd_pcm_substream_chip(substream); size_t ptr;

spin_lock(&ensoniq->reg_lock); - if (inl(ES_REG(ensoniq, CONTROL)) & ES_DAC1_EN) { - outl(ES_MEM_PAGEO(ES_PAGE_DAC), ES_REG(ensoniq, MEM_PAGE)); - ptr = ES_REG_FCURR_COUNTI(inl(ES_REG(ensoniq, DAC1_SIZE))); + if (ensoniq->ctrl & ES_DAC1_EN) { + ptr = get_cur_ptr(ensoniq, ES_MEM_PAGEO(ES_PAGE_DAC), + ES_REG(ensoniq, DAC1_SIZE)); + ptr = ES_REG_FCURR_COUNTI(ptr); ptr = bytes_to_frames(substream->runtime, ptr); } else { ptr = 0; @@ -1014,9 +1035,10 @@ static snd_pcm_uframes_t snd_ensoniq_playback2_pointer(struct snd_pcm_substream size_t ptr;

spin_lock(&ensoniq->reg_lock); - if (inl(ES_REG(ensoniq, CONTROL)) & ES_DAC2_EN) { - outl(ES_MEM_PAGEO(ES_PAGE_DAC), ES_REG(ensoniq, MEM_PAGE)); - ptr = ES_REG_FCURR_COUNTI(inl(ES_REG(ensoniq, DAC2_SIZE))); + if (ensoniq->ctrl & ES_DAC2_EN) { + ptr = get_cur_ptr(ensoniq, ES_MEM_PAGEO(ES_PAGE_DAC), + ES_REG(ensoniq, DAC2_SIZE)); + ptr = ES_REG_FCURR_COUNTI(ptr); ptr = bytes_to_frames(substream->runtime, ptr); } else { ptr = 0; @@ -1031,9 +1053,10 @@ static snd_pcm_uframes_t snd_ensoniq_capture_pointer(struct snd_pcm_substream *s size_t ptr;

spin_lock(&ensoniq->reg_lock); - if (inl(ES_REG(ensoniq, CONTROL)) & ES_ADC_EN) { - outl(ES_MEM_PAGEO(ES_PAGE_ADC), ES_REG(ensoniq, MEM_PAGE)); - ptr = ES_REG_FCURR_COUNTI(inl(ES_REG(ensoniq, ADC_SIZE))); + if (ensoniq->ctrl & ES_ADC_EN) { + ptr = get_cur_ptr(ensoniq, ES_MEM_PAGEO(ES_PAGE_ADC), + ES_REG(ensoniq, ADC_SIZE)); + ptr = ES_REG_FCURR_COUNTI(ptr); ptr = bytes_to_frames(substream->runtime, ptr); } else { ptr = 0;

Lennart Poettering

9:34 p.m.

On Fri, 20.02.09 08:26, Takashi Iwai (tiwai@suse.de) wrote:

...

...
Yes, I am. But ignore that part for now. I have now commented the use of that call. Now I certainly get more POLLOUTs as expected, but the real problem stays: after a few minutes _avail() will suddenly jump from next to zero to more then the hwbuf size in less than 1ms without any further inteference and with a buffer size of 350ms! There is something really wrong with the behaviour of _avail().

I can reproduce this only on ens1371 for now.

The ens1371 driver itself is damn simple. The pointer callback just returns the read value. So, it implies that it's basically a hardware issue.

Does the patch below have any influence?

No. It doesn't appear to have any effect whatsoever. Still, from time to time the value returned by _avail() is off by once the buffer size, sometimes upwards, sometimes downwards.

Moreover I can now reproduce the same issue on snd-intel8x0, albeit it takes more than a couple of minutes to make snd_pcm_avail() return bogus data.

Somehow I get the feeling the problem is not so much the inaccuracy of the pointer the kernel reports but in what alsa-libs does ith it.

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Takashi Iwai

21 Feb 21 Feb

5:36 p.m.

At Fri, 20 Feb 2009 21:34:17 +0100, Lennart Poettering wrote:

...

On Fri, 20.02.09 08:26, Takashi Iwai (tiwai@suse.de) wrote:

...
...
Yes, I am. But ignore that part for now. I have now commented the use of that call. Now I certainly get more POLLOUTs as expected, but the real problem stays: after a few minutes _avail() will suddenly jump from next to zero to more then the hwbuf size in less than 1ms without any further inteference and with a buffer size of 350ms! There is something really wrong with the behaviour of _avail().

I can reproduce this only on ens1371 for now.

The ens1371 driver itself is damn simple. The pointer callback just returns the read value. So, it implies that it's basically a hardware issue.

Does the patch below have any influence?

No. It doesn't appear to have any effect whatsoever. Still, from time to time the value returned by _avail() is off by once the buffer size, sometimes upwards, sometimes downwards.

Moreover I can now reproduce the same issue on snd-intel8x0, albeit it takes more than a couple of minutes to make snd_pcm_avail() return bogus data.

Somehow I get the feeling the problem is not so much the inaccuracy of the pointer the kernel reports but in what alsa-libs does ith it.

Hrm... both cases the pointer calculation is relatively simple, and for "hw" PCM, the avail calculation is also very straightforward. So, my first suspect is about the pointer calculation. But, of course we should check all possibilities.

Can you give a small testcase?

thanks,

Takashi

Lennart Poettering

22 Feb 22 Feb

4:14 a.m.

On Sat, 21.02.09 17:36, Takashi Iwai (tiwai@suse.de) wrote:

...

At Fri, 20 Feb 2009 21:34:17 +0100, Lennart Poettering wrote:

...
On Fri, 20.02.09 08:26, Takashi Iwai (tiwai@suse.de) wrote:

...
...
Yes, I am. But ignore that part for now. I have now commented the use of that call. Now I certainly get more POLLOUTs as expected, but the real problem stays: after a few minutes _avail() will suddenly jump from next to zero to more then the hwbuf size in less than 1ms without any further inteference and with a buffer size of 350ms! There is something really wrong with the behaviour of _avail().

I can reproduce this only on ens1371 for now.

The ens1371 driver itself is damn simple. The pointer callback just returns the read value. So, it implies that it's basically a hardware issue.

Does the patch below have any influence?

No. It doesn't appear to have any effect whatsoever. Still, from time to time the value returned by _avail() is off by once the buffer size, sometimes upwards, sometimes downwards.

Moreover I can now reproduce the same issue on snd-intel8x0, albeit it takes more than a couple of minutes to make snd_pcm_avail() return bogus data.

Somehow I get the feeling the problem is not so much the inaccuracy of the pointer the kernel reports but in what alsa-libs does ith it.

Hrm... both cases the pointer calculation is relatively simple, and for "hw" PCM, the avail calculation is also very straightforward. So, my first suspect is about the pointer calculation. But, of course we should check all possibilities.

Can you give a small testcase?

Ok, here's a little test case:

http://git.0pointer.de/?p=pulseaudio.git;a=blob_plain;f=src/tests/alsa-time-...

I wrote this little test to track down all kinds of timer issues: it opens and sets up a sound device. Then it will constantly query _avail(), _delay() and _htstamp() in a busy loop writing a single sample at a time. The three values are then printed along with a system timestamp. The data generated like this can be opened in gnumeric and a nice graph be drawn.

So much about the original purpose of this tool.

The issue in question can be triggered by simply running this tool and passing the device name as argv[0]. i.e. "alsa-time-test hw:AudioPCI".

The assert at the end of the file (line 189) is triggered when _avail() starts to become unreliable and jumps around.

For me to reproduce this issue all I have to do is run:

$ alsa-time-test hw:AudioPCI > log

and then wait for a minute or two. The assert is hit the tool aborted.

When that happens one can check the end of 'log' and will see something like this:

The first column is clock_gettime(CLOCK_MONOTONIC) in us. Second column in the timestamp stored in the snd_pcm_state_t. The third is the calculated playback time (i.e. write count - delay). The fourth column is avail, the fifth is delay.

In the last two lines (i.e. right when the assert aborted us) it becomes interesting. Suddenly in 30 us, the buffer jumped from an avail of 3885 to 6132 (buffer size is 4410).

I debugged a little bit through alsa-lib. Apparently in contrast to what I guessed before this issue seems indeed to be caused by a kernel bug: the hw_ptr right after the SNDRV_PCM_IOCTL_HWSYNC is already bogus.

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Lennart Poettering

23 Feb 23 Feb

3:42 a.m.

On Sun, 22.02.09 04:14, Lennart Poettering (mznyfn@0pointer.de) wrote:

...

I wrote this little test to track down all kinds of timer issues: it opens and sets up a sound device. Then it will constantly query _avail(), _delay() and _htstamp() in a busy loop writing a single sample at a time. The three values are then printed along with a system timestamp. The data generated like this can be opened in gnumeric and a nice graph be drawn.

Hmm, did some more testing with this tool on other cards:

On es1969 snd_pcm_avail() sometimes returns values like 1073728596 samples. This smells a lot like an overflow given that this times four (i.e. the sample size in bytes) is near to 2^32.

Here's an output of this tool for an emu10k1 card:

http://www.nopaste.pl/9vi

Check out line 4417: This is when the devcie starts playback. avail goes rightfully to 0 (i.e column 4 -- don't get confused by the line numbers the nopaste adds here), but the playback time suddenly jumps from 100113 us down to 702 us (column 4 which is the written sample counter minus the delay transformed to us). This is because in contrast to most other drivers for this one writing a sample has no direct effect on snd_pcm_delay while the device is stopped.

Also this shows that the data from _avail() and _status() together is not atomic: the columns for avail delay change in different rows when we don't get scheduled often enough.

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Lennart Poettering

3:56 a.m.

On Mon, 23.02.09 03:42, Lennart Poettering (mznyfn@0pointer.de) wrote:

...

On Sun, 22.02.09 04:14, Lennart Poettering (mznyfn@0pointer.de) wrote:

...
I wrote this little test to track down all kinds of timer issues: it opens and sets up a sound device. Then it will constantly query _avail(), _delay() and _htstamp() in a busy loop writing a single sample at a time. The three values are then printed along with a system timestamp. The data generated like this can be opened in gnumeric and a nice graph be drawn.

Hmm, did some more testing with this tool on other cards:

On es1969 snd_pcm_avail() sometimes returns values like 1073728596 samples. This smells a lot like an overflow given that this times four (i.e. the sample size in bytes) is near to 2^32.

[...]

...

Here's an output of this tool for an emu10k1 card:

[...]

This is the end of the tool's output for HDA AD1989B:

http://fpaste.org/paste/4240

The device is completely filled up (i.e. avail is 0, delay is 4416) and then after 2us suddenly the avail jumps to 4971973988617027465 and the delay to -4971973988617023049.

Smells like an integer overflow to me ...

Hmm, for now my little testing experiment I've now found es1969, ens1371, intel8x0, snd-emu10k1 and some intel-hda to have unreliable _avail() or _delay().

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Lennart Poettering

8:20 p.m.

On Mon, 23.02.09 03:56, Lennart Poettering (mznyfn@0pointer.de) wrote:

...

Hmm, for now my little testing experiment I've now found es1969, ens1371, intel8x0, snd-emu10k1 and some intel-hda to have unreliable _avail() or _delay().

And here's the output for HDA STAC92xx:

http://pastebin.mandriva.com/7149

The first dump is the interesting one: when audio starts to play the buffer abruptly runs empty in just 20us after the device started. (The rightmost column is the state btw, 3 == playing).

A quick overview of cards where _delay() and/or _avail() are unreliable:

es1969 ens1371 intel8x0 emu10k1 intel-hda on STAC92xx intel-hda on AD1989B

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Colin Guthrie

8:24 p.m.

'Twas brillig, and Lennart Poettering at 23/02/09 19:20 did gyre and gimble:

...

On Mon, 23.02.09 03:56, Lennart Poettering (mznyfn@0pointer.de) wrote:

...
Hmm, for now my little testing experiment I've now found es1969, ens1371, intel8x0, snd-emu10k1 and some intel-hda to have unreliable _avail() or _delay().

And here's the output for HDA STAC92xx:

http://pastebin.mandriva.com/7149

I realised that post will timeout after a month, so here is a longer term one :)

http://pastebin.mandriva.com/7238

...

The first dump is the interesting one: when audio starts to play the buffer abruptly runs empty in just 20us after the device started. (The rightmost column is the state btw, 3 == playing).

A quick overview of cards where _delay() and/or _avail() are unreliable:

es1969 ens1371 intel8x0 emu10k1 intel-hda on STAC92xx intel-hda on AD1989B

Lennart

-- Colin Guthrie gmane(at)colin.guthr.ie http://colin.guthr.ie/ Day Job: Tribalogic Limited [http://www.tribalogic.net/] Open Source: Mandriva Linux Contributor [http://www.mandriva.com/] PulseAudio Hacker [http://www.pulseaudio.org/] Trac Hacker [http://trac.edgewall.org/]

Lennart Poettering

24 Feb 24 Feb

4:21 a.m.

On Mon, 23.02.09 20:20, Lennart Poettering (mznyfn@0pointer.de) wrote:

...

On Mon, 23.02.09 03:56, Lennart Poettering (mznyfn@0pointer.de) wrote:

...
Hmm, for now my little testing experiment I've now found es1969, ens1371, intel8x0, snd-emu10k1 and some intel-hda to have unreliable _avail() or _delay().

And here's the output for HDA STAC92xx:

http://pastebin.mandriva.com/7149

The first dump is the interesting one: when audio starts to play the buffer abruptly runs empty in just 20us after the device started. (The rightmost column is the state btw, 3 == playing).

A quick overview of cards where _delay() and/or _avail() are unreliable:

es1969 ens1371 intel8x0 emu10k1 intel-hda on STAC92xx intel-hda on AD1989B

And here's another one, for intel-hda on ALC883:

Sometimes snd_pcm_avail() returns values like 4611686018427387098.

Smells like another overflow to me.

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Takashi Iwai

23 Feb 23 Feb

8:47 a.m.

At Sun, 22 Feb 2009 04:14:53 +0100, Lennart Poettering wrote:

...

On Sat, 21.02.09 17:36, Takashi Iwai (tiwai@suse.de) wrote:

...
At Fri, 20 Feb 2009 21:34:17 +0100, Lennart Poettering wrote:

...
On Fri, 20.02.09 08:26, Takashi Iwai (tiwai@suse.de) wrote:

...
...
Yes, I am. But ignore that part for now. I have now commented the use of that call. Now I certainly get more POLLOUTs as expected, but the real problem stays: after a few minutes _avail() will suddenly jump from next to zero to more then the hwbuf size in less than 1ms without any further inteference and with a buffer size of 350ms! There is something really wrong with the behaviour of _avail().

I can reproduce this only on ens1371 for now.

The ens1371 driver itself is damn simple. The pointer callback just returns the read value. So, it implies that it's basically a hardware issue.

Does the patch below have any influence?

No. It doesn't appear to have any effect whatsoever. Still, from time to time the value returned by _avail() is off by once the buffer size, sometimes upwards, sometimes downwards.

Moreover I can now reproduce the same issue on snd-intel8x0, albeit it takes more than a couple of minutes to make snd_pcm_avail() return bogus data.

Somehow I get the feeling the problem is not so much the inaccuracy of the pointer the kernel reports but in what alsa-libs does ith it.

Hrm... both cases the pointer calculation is relatively simple, and for "hw" PCM, the avail calculation is also very straightforward. So, my first suspect is about the pointer calculation. But, of course we should check all possibilities.

Can you give a small testcase?

Ok, here's a little test case:

http://git.0pointer.de/?p=pulseaudio.git;a=blob_plain;f=src/tests/alsa-time-...

Thanks. Will check it later.

Takashi

Takashi Iwai

24 Feb 24 Feb

5:27 p.m.

At Sun, 22 Feb 2009 04:14:53 +0100, Lennart Poettering wrote:

...

On Sat, 21.02.09 17:36, Takashi Iwai (tiwai@suse.de) wrote:

...
At Fri, 20 Feb 2009 21:34:17 +0100, Lennart Poettering wrote:

...
On Fri, 20.02.09 08:26, Takashi Iwai (tiwai@suse.de) wrote:

...
...
Yes, I am. But ignore that part for now. I have now commented the use of that call. Now I certainly get more POLLOUTs as expected, but the real problem stays: after a few minutes _avail() will suddenly jump from next to zero to more then the hwbuf size in less than 1ms without any further inteference and with a buffer size of 350ms! There is something really wrong with the behaviour of _avail().

I can reproduce this only on ens1371 for now.

The ens1371 driver itself is damn simple. The pointer callback just returns the read value. So, it implies that it's basically a hardware issue.

Does the patch below have any influence?

No. It doesn't appear to have any effect whatsoever. Still, from time to time the value returned by _avail() is off by once the buffer size, sometimes upwards, sometimes downwards.

Moreover I can now reproduce the same issue on snd-intel8x0, albeit it takes more than a couple of minutes to make snd_pcm_avail() return bogus data.

Somehow I get the feeling the problem is not so much the inaccuracy of the pointer the kernel reports but in what alsa-libs does ith it.

Hrm... both cases the pointer calculation is relatively simple, and for "hw" PCM, the avail calculation is also very straightforward. So, my first suspect is about the pointer calculation. But, of course we should check all possibilities.

Can you give a small testcase?

Ok, here's a little test case:

http://git.0pointer.de/?p=pulseaudio.git;a=blob_plain;f=src/tests/alsa-time-...

Finally have time to go back to this issue again now.

With your test case, indeed I get abort(), too. However, checking through the code, you set stop_threshold to the boundary size like:

r = snd_pcm_sw_params_get_boundary(swparams, &boundary); r = snd_pcm_sw_params_set_stop_threshold(pcm, swparams, boundary);

This means essentially you are disabling the XRUN detection in the driver and keep it free-running. So, no wonder avail gets over buffer_size. It's actually an XRUN condition, but just ignored due to this setup. (Usually, stop_threshould == buffer_size.)

Or do I misunderstand the intention of your testcase?

thanks,

Takashi

Lennart Poettering

7:46 p.m.

On Tue, 24.02.09 17:27, Takashi Iwai (tiwai@suse.de) wrote:

...

...
...
...
Moreover I can now reproduce the same issue on snd-intel8x0, albeit it takes more than a couple of minutes to make snd_pcm_avail() return bogus data.

Somehow I get the feeling the problem is not so much the inaccuracy of the pointer the kernel reports but in what alsa-libs does ith it.

Hrm... both cases the pointer calculation is relatively simple, and for "hw" PCM, the avail calculation is also very straightforward. So, my first suspect is about the pointer calculation. But, of course we should check all possibilities.

Can you give a small testcase?

Ok, here's a little test case:

http://git.0pointer.de/?p=pulseaudio.git;a=blob_plain;f=src/tests/alsa-time-...

Finally have time to go back to this issue again now.

With your test case, indeed I get abort(), too. However, checking through the code, you set stop_threshold to the boundary size like:
r = snd_pcm_sw_params_get_boundary(swparams, &boundary);
r = snd_pcm_sw_params_set_stop_threshold(pcm, swparams, boundary);
This means essentially you are disabling the XRUN detection in the driver and keep it free-running. So, no wonder avail gets over buffer_size. It's actually an XRUN condition, but just ignored due to this setup. (Usually, stop_threshould == buffer_size.)

Yes, threshold is set to boundary. But that's intended. Remember that the original purpose of the tool is to graph _avail() and _delay() against the time. For that I want to make sure the that timing stays stable no matter what.

In reality this setting should not matter at all, because most machines should be fast enough to keep the buffer filled even if we write one sample at a time as it is done in the example.

The problems exposed in all the test data posted here for the various sound cards are not normal normal buffer underruns.

In summary: just ignore the setting of the stop threshold. It is not related to the problems exposed here.

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Takashi Iwai

7:59 p.m.

At Tue, 24 Feb 2009 19:46:05 +0100, Lennart Poettering wrote:

...

On Tue, 24.02.09 17:27, Takashi Iwai (tiwai@suse.de) wrote:

...
...
...
...
Moreover I can now reproduce the same issue on snd-intel8x0, albeit it takes more than a couple of minutes to make snd_pcm_avail() return bogus data.

Somehow I get the feeling the problem is not so much the inaccuracy of the pointer the kernel reports but in what alsa-libs does ith it.

Hrm... both cases the pointer calculation is relatively simple, and for "hw" PCM, the avail calculation is also very straightforward. So, my first suspect is about the pointer calculation. But, of course we should check all possibilities.

Can you give a small testcase?

Ok, here's a little test case:

http://git.0pointer.de/?p=pulseaudio.git;a=blob_plain;f=src/tests/alsa-time-...

Finally have time to go back to this issue again now.

With your test case, indeed I get abort(), too. However, checking through the code, you set stop_threshold to the boundary size like:
r = snd_pcm_sw_params_get_boundary(swparams, &boundary);
r = snd_pcm_sw_params_set_stop_threshold(pcm, swparams, boundary);
This means essentially you are disabling the XRUN detection in the driver and keep it free-running. So, no wonder avail gets over buffer_size. It's actually an XRUN condition, but just ignored due to this setup. (Usually, stop_threshould == buffer_size.)
Yes, threshold is set to boundary. But that's intended. Remember that the original purpose of the tool is to graph _avail() and _delay() against the time. For that I want to make sure the that timing stays stable no matter what.

In reality this setting should not matter at all, because most machines should be fast enough to keep the buffer filled even if we write one sample at a time as it is done in the example.

Not really. If you feed the output to a terminal, it's damn slow. That's why I got XRUN.

...

The problems exposed in all the test data posted here for the various sound cards are not normal normal buffer underruns.

Well, basically "avail >= buffer_size" means the underrun per definition. The state isn't changed and kept as RUNNING just because of the stop_threshold value.

...

In summary: just ignore the setting of the stop threshold. It is not related to the problems exposed here.

FWIW, when I run the program and feed the output to /dev/null, I don't get abort() after minutes. It happened soon only when it runs on a terminal. So, the condition appears very likely as an XRUN.

So, try to set stop_threshold to buffer_size once, and check whether you get any different result. Let's check another possibility if it really isn't the case.

Takashi

Jaroslav Kysela

8:04 p.m.

On Tue, 24 Feb 2009, Takashi Iwai wrote:

...

...
In reality this setting should not matter at all, because most machines should be fast enough to keep the buffer filled even if we write one sample at a time as it is done in the example.

Not really. If you feed the output to a terminal, it's damn slow. That's why I got XRUN.

I figured this few seconds ago, too. It's the terminal output which makes XRUN.

Jaroslav

----- Jaroslav Kysela perex@perex.cz Linux Kernel Sound Maintainer ALSA Project, Red Hat, Inc.

Lennart Poettering

8:26 p.m.

On Tue, 24.02.09 19:59, Takashi Iwai (tiwai@suse.de) wrote:

...

...
Yes, threshold is set to boundary. But that's intended. Remember that the original purpose of the tool is to graph _avail() and _delay() against the time. For that I want to make sure the that timing stays stable no matter what.

In reality this setting should not matter at all, because most machines should be fast enough to keep the buffer filled even if we write one sample at a time as it is done in the example.

Not really. If you feed the output to a terminal, it's damn slow. That's why I got XRUN.

Oh, of course, I should have mentioned that.

The tool generates a *lot* of data, given that it queries _avail() and _delay() in a busy loop.

To run it either redirect the output to disk (which of course might create gaps in the output if we write faster than the disk can take it but this is easily detected by looking on the first column). Or pipe it directly to something like "tail -n 50".

...

...
In summary: just ignore the setting of the stop threshold. It is not related to the problems exposed here.

FWIW, when I run the program and feed the output to /dev/null, I don't get abort() after minutes. It happened soon only when it runs on a terminal. So, the condition appears very likely as an XRUN.

The outputs I posted earlier are *not* due to normal underruns. Please have a look at the actual data!

...

So, try to set stop_threshold to buffer_size once, and check whether you get any different result. Let's check another possibility if it really isn't the case.

In those test cases no normal XRUN ever happens unless you pipe the output to a terminal: please, just actaully look at the data generated!

The effective difference for this tool that it makes to set stop_threshold to boundary or not is simply this:

if stop_threshold is set to boundary underruns will be caught by the mentioned assert.

if stop_threshold is not set to boundary underruns result in an EPIPE which is then caught by a different assert in the code.

So again: how stop_threshold is set is *irrelevant* for the test case! It simply would cause a different assert() to be triggered! That is all. And again, unless you machine is very slow or otherwise busy _avail() should never come near the buffer size except very early afetr startup.

stop_threshold is completely irrelevant to the problems discussed here.

Oh, and one thing I didn't actually notice earlier: Most drivers return a negative snd_pcm_delay() if a real underrun happens. According to the definition of snd_pcm_delay() that we agreed on a couple of months ago and that is now docuemented in doxygen this makes no sense. The definition of snd_pcm_delay() goes like this:

"For playback the delay is defined as the time that a frame that is written to the PCM stream shortly after this call will take to be actually audible. It is as such the overall latency from the write call to the final DAC." (from the doxygen docs)

I.e. because on the this world it is impossible to hear a sample that hasn't even been written yet, _delay() should only return positive values. However many drivers do return negative values on underrun.

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Takashi Iwai

9:37 p.m.

At Tue, 24 Feb 2009 20:26:11 +0100, Lennart Poettering wrote:

...

On Tue, 24.02.09 19:59, Takashi Iwai (tiwai@suse.de) wrote:

...
...
Yes, threshold is set to boundary. But that's intended. Remember that the original purpose of the tool is to graph _avail() and _delay() against the time. For that I want to make sure the that timing stays stable no matter what.

In reality this setting should not matter at all, because most machines should be fast enough to keep the buffer filled even if we write one sample at a time as it is done in the example.

Not really. If you feed the output to a terminal, it's damn slow. That's why I got XRUN.

Oh, of course, I should have mentioned that.

The tool generates a *lot* of data, given that it queries _avail() and _delay() in a busy loop.

To run it either redirect the output to disk (which of course might create gaps in the output if we write faster than the disk can take it but this is easily detected by looking on the first column). Or pipe it directly to something like "tail -n 50".

Damn, you should have mentioned it before. I lost too long time just for that!

...

...
...
In summary: just ignore the setting of the stop threshold. It is not related to the problems exposed here.

FWIW, when I run the program and feed the output to /dev/null, I don't get abort() after minutes. It happened soon only when it runs on a terminal. So, the condition appears very likely as an XRUN.

The outputs I posted earlier are *not* due to normal underruns. Please have a look at the actual data!

I didn't get any "actual data" because of xrun.

...

...
So, try to set stop_threshold to buffer_size once, and check whether you get any different result. Let's check another possibility if it really isn't the case.

In those test cases no normal XRUN ever happens unless you pipe the output to a terminal: please, just actaully look at the data generated!

The data isn't generated if XRUN occurs because it stops with assert() at that point.

...

The effective difference for this tool that it makes to set stop_threshold to boundary or not is simply this:
if stop_threshold is set to boundary underruns will be caught by the
mentioned assert.

if stop_threshold is not set to boundary underruns result in an EPIPE
which is then caught by a different assert in the code.
So again: how stop_threshold is set is *irrelevant* for the test case!

But the assert caught exaclty the XRUN. Not *always* irrelevant.

...

It simply would cause a different assert() to be triggered! That is all. And again, unless you machine is very slow or otherwise busy _avail() should never come near the buffer size except very early afetr startup.

It came. Maybe it's slow.

...

stop_threshold is completely irrelevant to the problems discussed here.

OK, but let me continue that tomorrow or later.

...

Oh, and one thing I didn't actually notice earlier: Most drivers return a negative snd_pcm_delay() if a real underrun happens. According to the definition of snd_pcm_delay() that we agreed on a couple of months ago and that is now docuemented in doxygen this makes no sense. The definition of snd_pcm_delay() goes like this:

"For playback the delay is defined as the time that a frame that is written to the PCM stream shortly after this call will take to be actually audible. It is as such the overall latency from the write call to the final DAC." (from the doxygen docs)

I.e. because on the this world it is impossible to hear a sample that hasn't even been written yet, _delay() should only return positive values. However many drivers do return negative values on underrun.

Maybe you are walking faster than the light speed.

I don't mind to add a negative check in the delay value, but who would actually care? It's underrun, so the behavior of the hardware isn't really defined...

Takashi

Takashi Iwai

25 Feb 25 Feb

11:08 a.m.

At Tue, 24 Feb 2009 21:37:20 +0100, I wrote:

...

...
stop_threshold is completely irrelevant to the problems discussed here.

OK, but let me continue that tomorrow or later.

Running it with five sound cards (two HD-audio, emu10k1, CMI8738, and envy24HT) overnight with tail -500, but nothing appears on my machines.

All outputs look OK: the first three columns are almost same values, avail is a few (< 10), delay is about the buffer size.

How can you trigger the problem so easily?

Takashi

Takashi Iwai

11:22 a.m.

At Wed, 25 Feb 2009 11:08:50 +0100, I wrote:

...

At Tue, 24 Feb 2009 21:37:20 +0100, I wrote:

...
...
stop_threshold is completely irrelevant to the problems discussed here.

OK, but let me continue that tomorrow or later.

Running it with five sound cards (two HD-audio, emu10k1, CMI8738, and envy24HT) overnight with tail -500, but nothing appears on my machines.

All outputs look OK: the first three columns are almost same values, avail is a few (< 10), delay is about the buffer size.

How can you trigger the problem so easily?

Could you check the raw hwptr and applptr values surrounding the error time, e.g. with the patch to alsa-lib below? If it's really hwsync that makes values unstable, we have to see the value skips there...

Takashi

--- diff --git a/src/pcm/pcm_hw.c b/src/pcm/pcm_hw.c index e9ce092..6dd1633 100644 --- a/src/pcm/pcm_hw.c +++ b/src/pcm/pcm_hw.c @@ -545,11 +545,20 @@ static int snd_pcm_hw_hwsync(snd_pcm_t *pcm) if (err < 0) return err; } else { + long old_hwptr, old_applptr; + long new_hwptr, new_applptr; + old_hwptr = *pcm->hw.ptr; + old_applptr = *pcm->appl.ptr; if (ioctl(fd, SNDRV_PCM_IOCTL_HWSYNC) < 0) { err = -errno; SYSMSG("SNDRV_PCM_IOCTL_HWSYNC failed"); return err; } + new_hwptr = *pcm->hw.ptr; + new_applptr = *pcm->appl.ptr; + printf("hwsync: %ld(%ld), %ld(%ld)\n", + new_hwptr, new_hwptr - old_hwptr, + new_applptr, new_applptr - old_applptr); } } else { snd_pcm_sframes_t delay;

Clive Messer

1:34 p.m.

On Wednesday 25 Feb 2009 10:22:26 Takashi Iwai wrote:

...

Could you check the raw hwptr and applptr values surrounding the error time, e.g. with the patch to alsa-lib below? If it's really hwsync that makes values unstable, we have to see the value skips there...

Does this help? Log attached. (head -n 100, tail -n 100)

Hardware is .....

00:1b.0 Audio device: Intel Corporation 82801JI (ICH10 Family) HD Audio Controller

module: snd_hda_intel pcm: AD198x Analog kernel: kernel-2.6.29-0.43.rc6.fc10.x86_64 alsa: alsa-lib-1.0.19-2.fc10.x86_64 (plus printf patch)

Regards

Clive

-- Clive Messer clive@vacuumtube.org.uk

Takashi Iwai

2:36 p.m.

At Wed, 25 Feb 2009 12:34:37 +0000, Clive Messer wrote:

...

On Wednesday 25 Feb 2009 10:22:26 Takashi Iwai wrote:

...
Could you check the raw hwptr and applptr values surrounding the error time, e.g. with the patch to alsa-lib below? If it's really hwsync that makes values unstable, we have to see the value skips there...

Does this help? Log attached. (head -n 100, tail -n 100)

Thanks. At least it shows the values in hwsync ioctl are sane. So something wrong in elsewhere.

Could you check whether this error happens with sync_ptr_ioctl flag in PCM config? For example, define the below in ~/.asoundrc,

pcm.mytest { type hw card 0 sync_ptr_ioctl true }

Then run with PCM "mytest".

Takashi

Clive Messer

4:11 p.m.

On Wednesday 25 Feb 2009 13:36:23 Takashi Iwai wrote:

...

Could you check whether this error happens with sync_ptr_ioctl flag in PCM config? For example, define the below in ~/.asoundrc,

pcm.mytest { type hw card 0 sync_ptr_ioctl true }

Then run with PCM "mytest".

./alsa-time-test mytest > pcm_mytest.log

head & tail 50 lines - attached.

Regards

Clive

-- Clive Messer clive@vacuumtube.org.uk

Takashi Iwai

4:51 p.m.

At Wed, 25 Feb 2009 15:11:44 +0000, Clive Messer wrote:

...

36192 36191 100068 3 0 0 1 2 36195 36194 100090 2 0 0 1 2 36198 36198 100113 1 0 0 1 2 36207 36206 63854 2816 1600 0 1 3 36212 36211 63854 2815 1601 4 1 3 36216 36216 63854 2814 1602 4 1 3 36221 36220 63854 2813 1603 4 1 3 36225 36225 101224 4460 -44 4 1 3

This is pretty weird. Is this behavior always reproducible? What shows with the patch below to alsa-lib?

Also, any change if you play with position_fix parameter, such as, position_fix=1?

Takashi

--- diff --git a/src/pcm/pcm_hw.c b/src/pcm/pcm_hw.c index e9ce092..9b91a13 100644 --- a/src/pcm/pcm_hw.c +++ b/src/pcm/pcm_hw.c @@ -539,6 +539,10 @@ static int snd_pcm_hw_hwsync(snd_pcm_t *pcm) { snd_pcm_hw_t *hw = pcm->private_data; int fd = hw->fd, err; + long old_hwptr, new_hwptr; + long old_applptr, new_applptr; + old_hwptr = *pcm->hw.ptr; + old_applptr = *pcm->appl.ptr; if (SNDRV_PROTOCOL_VERSION(2, 0, 3) <= hw->version) { if (hw->sync_ptr) { err = sync_ptr1(hw, SNDRV_PCM_SYNC_PTR_HWSYNC); @@ -564,6 +568,11 @@ static int snd_pcm_hw_hwsync(snd_pcm_t *pcm) } } } + new_hwptr = *pcm->hw.ptr; + new_applptr = *pcm->appl.ptr; + printf("hwsync: %ld(%ld), %ld(%ld)\n", + new_hwptr, new_hwptr - old_hwptr, + new_applptr, new_applptr - old_applptr); return 0; }

Clive Messer

5:24 p.m.

On Wednesday 25 Feb 2009 15:51:33 you wrote:

...

At Wed, 25 Feb 2009 15:11:44 +0000,

Clive Messer wrote:

...
36192 36191 100068 3 0 0 1 2 36195 36194 100090 2 0 0 1 2 36198 36198 100113 1 0 0 1 2 36207 36206 63854 2816 1600 0 1 3 36212 36211 63854 2815 1601 4 1 3 36216 36216 63854 2814 1602 4 1 3 36221 36220 63854 2813 1603 4 1 3 36225 36225 101224 4460 -44 4 1 3

This is pretty weird. Is this behavior always reproducible?

Nope. I get different results every time I run it.

...

What shows with the patch below to alsa-lib?

Here's one (attached) I just did with the new patch that shows the crazy high numbers again.

...

Also, any change if you play with position_fix parameter, such as, position_fix=1?

I'll try that module option next.

Regards

Clive

-- Clive Messer clive@vacuumtube.org.uk

Takashi Iwai

5:39 p.m.

At Wed, 25 Feb 2009 16:24:18 +0000, Clive Messer wrote:

...

On Wednesday 25 Feb 2009 15:51:33 you wrote:

...
At Wed, 25 Feb 2009 15:11:44 +0000,

Clive Messer wrote:

...
36192 36191 100068 3 0 0 1 2 36195 36194 100090 2 0 0 1 2 36198 36198 100113 1 0 0 1 2 36207 36206 63854 2816 1600 0 1 3 36212 36211 63854 2815 1601 4 1 3 36216 36216 63854 2814 1602 4 1 3 36221 36220 63854 2813 1603 4 1 3 36225 36225 101224 4460 -44 4 1 3

This is pretty weird. Is this behavior always reproducible?

Nope. I get different results every time I run it.

...
What shows with the patch below to alsa-lib?

Here's one (attached) I just did with the new patch that shows the crazy high numbers again. hwsync: 2080(0), 6496(0) 70758 70758 47142 0 4416 0 0 3 hwsync: 1472(0), 6496(0) 70763 70762 104573379136040 4971973988617026976 -4971973988617022560 4 1 3

The problem is that hwptr goes backward here. But it didn't happen in this hwsync call but sometime before that.

Could you try the patch below instead? This will track all sync_ptr calls.

Takashi

--- diff --git a/src/pcm/pcm_hw.c b/src/pcm/pcm_hw.c index e9ce092..ad8d4a5 100644 --- a/src/pcm/pcm_hw.c +++ b/src/pcm/pcm_hw.c @@ -131,6 +131,10 @@ struct timespec snd_pcm_hw_fast_tstamp(snd_pcm_t *pcm) static int sync_ptr1(snd_pcm_hw_t *hw, unsigned int flags) { int err; + long old_hwptr, new_hwptr; + long old_applptr, new_applptr; + old_hwptr = *pcm->hw.ptr; + old_applptr = *pcm->appl.ptr; hw->sync_ptr->flags = flags; err = ioctl((hw)->fd, SNDRV_PCM_IOCTL_SYNC_PTR, (hw)->sync_ptr); if (err < 0) { @@ -138,6 +142,11 @@ static int sync_ptr1(snd_pcm_hw_t *hw, unsigned int flags) SYSMSG("SNDRV_PCM_IOCTL_SYNC_PTR failed"); return err; } + new_hwptr = *pcm->hw.ptr; + new_applptr = *pcm->appl.ptr; + printf("sync_ptr1: %ld(%ld), %ld(%ld)\n", + new_hwptr, new_hwptr - old_hwptr, + new_applptr, new_applptr - old_applptr); return 0; }

Takashi Iwai

4:04 p.m.

Wait a minute...

At Wed, 25 Feb 2009 12:34:37 +0000, Clive Messer wrote:

...

hwsync: 1108800(0), 1113216(0) 25178947 25178947 25142834 0 4416 0 0 3 hwsync: 1108800(0), 1113216(0) 25693169 25693168 25657142 21697 -18264 0 1 3

The time leaps indeed here about 0.5 sec. Assuming this time-keeping is correct, the avail and delay values there are acceptable.

In this case, apparently a buffer underrun occurs due to a sudden system load or so. It's no bug.

Takashi

Lennart Poettering

11:44 a.m.

On Wed, 25.02.09 11:08, Takashi Iwai (tiwai@suse.de) wrote:

...

At Tue, 24 Feb 2009 21:37:20 +0100, I wrote:

...
...
stop_threshold is completely irrelevant to the problems discussed here.

OK, but let me continue that tomorrow or later.

Running it with five sound cards (two HD-audio, emu10k1, CMI8738, and envy24HT) overnight with tail -500, but nothing appears on my machines.

All outputs look OK: the first three columns are almost same values, avail is a few (< 10), delay is about the buffer size.

How can you trigger the problem so easily?

Hmm, that's a good question. It's not just me btw who can reproduce this that easily. I have put up a page (see bottom of http://pulseaudio.org/wiki/BrokenSoundDrivers) for this tool during the WE and got about 20 reports back (some of this I actually mentioned on the ML already), for different cards on different distros (from all of Fedora, OpenSUSE, Mandriva, Ubuntu and Debian).

Most of this issues can be reproduced in 5 mins or so. For the intel8x0 issues it was a little bit harder, took 30min to be triggered.

Most of my own tests was one SMP btw, not sure if that matters. The excpetion is the intl8x0 stuff which I tested in a machine with a single CPU.

Since a few PA versions back I write into syslog when snd_pcm_avail() returns apparently overflown values. Our bugzilla is now full of reports of messages like this. Here are a two of them as examples:

https://bugzilla.redhat.com/show_bug.cgi?id=485734 https://bugzilla.redhat.com/show_bug.cgi?id=483559

(Look for the snd_pcm_avail_update() log messages in those reports)

BTW, any update on the issue that calling poll() on an ALSA device returns POLLOUT however a subsequent snd_pcm_avail() returns 0 or some other value < min_avail? I reported that a while back on the ML. And a *lot* of drivers are suffering by this. This causes the PA IO loop to spin quite often for no reason.

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Lennart Poettering

24 Feb 24 Feb

10:16 p.m.

On Tue, 24.02.09 20:26, Lennart Poettering (mznyfn@0pointer.de) wrote:

...

Oh, and one thing I didn't actually notice earlier: Most drivers return a negative snd_pcm_delay() if a real underrun happens. According to the definition of snd_pcm_delay() that we agreed on a couple of months ago and that is now docuemented in doxygen this makes no sense. The definition of snd_pcm_delay() goes like this:

"For playback the delay is defined as the time that a frame that is written to the PCM stream shortly after this call will take to be actually audible. It is as such the overall latency from the write call to the final DAC." (from the doxygen docs)

I.e. because on the this world it is impossible to hear a sample that hasn't even been written yet, _delay() should only return positive values. However many drivers do return negative values on underrun.

I take this back.

I think the correct way to handle an underrun when stop_threshold is set to boundary is that if a write happens after an underrun the appropriate amount of data is simply dropped. I think enabling this mode is primarily useful to guarantee timer stability even in case of underrun. That means snd_pcm_delay() should very well return negative values meaning "what you write next is past the playback pointer, it will not be heard".

Lennart

-- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4

Takashi Iwai

25 Feb 25 Feb

11:48 a.m.

At Tue, 24 Feb 2009 22:16:21 +0100, Lennart Poettering wrote:

...

On Tue, 24.02.09 20:26, Lennart Poettering (mznyfn@0pointer.de) wrote:

...
Oh, and one thing I didn't actually notice earlier: Most drivers return a negative snd_pcm_delay() if a real underrun happens. According to the definition of snd_pcm_delay() that we agreed on a couple of months ago and that is now docuemented in doxygen this makes no sense. The definition of snd_pcm_delay() goes like this:

"For playback the delay is defined as the time that a frame that is written to the PCM stream shortly after this call will take to be actually audible. It is as such the overall latency from the write call to the final DAC." (from the doxygen docs)

I.e. because on the this world it is impossible to hear a sample that hasn't even been written yet, _delay() should only return positive values. However many drivers do return negative values on underrun.

I take this back.

I think the correct way to handle an underrun when stop_threshold is set to boundary is that if a write happens after an underrun the appropriate amount of data is simply dropped. I think enabling this mode is primarily useful to guarantee timer stability even in case of underrun. That means snd_pcm_delay() should very well return negative values meaning "what you write next is past the playback pointer, it will not be heard".

Well, I'm afraid that it's undesired behavior.

In a free-wheel mode, there is no real underrun (except that the driver pointer callback explicitly returns the error). It basically means "I do care any XRUN errors by myself, so don't bother me no matter what crazy value is set." PCM core shouldn't react so much differently depending on the avail/delay value in this mode.

Anyway, the definitions of delay and avail in a free-wheel mode are ambiguous and rather confusing indeed...

Takashi

Jaroslav Kysela

11:57 a.m.

On Wed, 25 Feb 2009, Takashi Iwai wrote:

...

Anyway, the definitions of delay and avail in a free-wheel mode are ambiguous and rather confusing indeed...

I don't agree. We manage appl_ptr also in no-XRUN check mode for ALSA apps, so the negative values makes sense and applications can quickly skip missed samples (snd_pcm_forward).

Jaroslav

----- Jaroslav Kysela perex@perex.cz Linux Kernel Sound Maintainer ALSA Project, Red Hat, Inc.

Takashi Iwai

12:04 p.m.

At Wed, 25 Feb 2009 11:57:32 +0100 (CET), Jaroslav Kysela wrote:

...

On Wed, 25 Feb 2009, Takashi Iwai wrote:

...
Anyway, the definitions of delay and avail in a free-wheel mode are ambiguous and rather confusing indeed...

I don't agree. We manage appl_ptr also in no-XRUN check mode for ALSA apps, so the negative values makes sense and applications can quickly skip missed samples (snd_pcm_forward).

Well, if we follow the definition of snd_pcm_delay(), applptr - delay doesn't always point to the position to write for the next. (Right now it works so, though).

Takashi

Jaroslav Kysela

12:10 p.m.

On Wed, 25 Feb 2009, Takashi Iwai wrote:

...

At Wed, 25 Feb 2009 11:57:32 +0100 (CET), Jaroslav Kysela wrote:

...
On Wed, 25 Feb 2009, Takashi Iwai wrote:

...
Anyway, the definitions of delay and avail in a free-wheel mode are ambiguous and rather confusing indeed...

I don't agree. We manage appl_ptr also in no-XRUN check mode for ALSA apps, so the negative values makes sense and applications can quickly skip missed samples (snd_pcm_forward).

Well, if we follow the definition of snd_pcm_delay(), applptr - delay doesn't always point to the position to write for the next. (Right now it works so, though).

Sure, but negative value indicates how many samples is application behind the output from hardware. The delay value cannot be used for buffer filling, of course (also in standard - no-XRUN case - avail functions should be used).

Jaroslav

----- Jaroslav Kysela perex@perex.cz Linux Kernel Sound Maintainer ALSA Project, Red Hat, Inc.

Takashi Iwai

12:17 p.m.

At Wed, 25 Feb 2009 12:10:24 +0100 (CET), Jaroslav Kysela wrote:

...

On Wed, 25 Feb 2009, Takashi Iwai wrote:

...
At Wed, 25 Feb 2009 11:57:32 +0100 (CET), Jaroslav Kysela wrote:

...
On Wed, 25 Feb 2009, Takashi Iwai wrote:

...
Anyway, the definitions of delay and avail in a free-wheel mode are ambiguous and rather confusing indeed...

I don't agree. We manage appl_ptr also in no-XRUN check mode for ALSA apps, so the negative values makes sense and applications can quickly skip missed samples (snd_pcm_forward).

Well, if we follow the definition of snd_pcm_delay(), applptr - delay doesn't always point to the position to write for the next. (Right now it works so, though).

Sure, but negative value indicates how many samples is application behind the output from hardware. The delay value cannot be used for buffer filling, of course (also in standard - no-XRUN case - avail functions should be used).

The problem I foresee is that a hardware with pre-fetch style doesn't work reliably with forwarding based on the delay value. The free-wheel support on such a hardware is anyway tricky, so it's not only about avail and delay, though.

Takashi

5982

Age (days ago)

5988

Last active (days ago)

List overview

Download

36 comments

5 participants

tags (0)

participants (5)

Clive Messer
Colin Guthrie
Jaroslav Kysela
Lennart Poettering
Takashi Iwai