[alsa-devel] Misusing snd_pcm_avail_update()
Heya!
Currently in the 'glitch-free' logic of PulseAudio I use snd_pcm_avail_update() to estimate how I need to program my system timers for the next wake-up for the next buffer fill-up. For that I assume that the current fill level of the hardware buffer is the hardware buffer size minus what s_p_a_u() returns. I then convert that fill level from sample units to time units, and fix it up by the deviation of the sound card time from the system time. Finally I substract some extra margin just to make sure.
This I assumed would tell me how much time will pass until an underrun happens if I don't write anything.
Mostly this logic works fine. But on some setups and cases it doesn't. ALSA will signal an underrun much much earlier than what I estimated like this.
I am now wondering why? One possibility of course is that s_p_a_u() is not reliable, due to driver issues (there were problems in the HDA driver about this, right?). Also, s_p_a_u() might simply lag behind quite a bit, or -- what I think is most likely -- because samples are popped in larger blocks form the hw playback buffer we reach the underrun much earlier than expected.
I do acknowledge that the way i use s_p_a_u() is probably a misuse of the API. I make asssumptions I probably shouldn't make.
Now, considering all this I'd like to ask for a new API function that tells me how much time I really have before the next underrun. It probably should just return a value in sample units, leaving it for the application to deal with system/sound card clock deviations.
Any opinions on this?
Lennart
Lennart Poettering wrote:
Currently in the 'glitch-free' logic of PulseAudio I use snd_pcm_avail_update() to estimate how I need to program my system timers for the next wake-up for the next buffer fill-up. For that I assume that the current fill level of the hardware buffer is the hardware buffer size minus what s_p_a_u() returns. I then convert that fill level from sample units to time units, and fix it up by the deviation of the sound card time from the system time. Finally I substract some extra margin just to make sure.
This I assumed would tell me how much time will pass until an underrun happens if I don't write anything.
Mostly this logic works fine. But on some setups and cases it doesn't. ALSA will signal an underrun much much earlier than what I estimated like this.
I am now wondering why? One possibility of course is that s_p_a_u() is not reliable, due to driver issues (there were problems in the HDA driver about this, right?).
Some hardware doesn't realiably tell the current position in the buffer.
Also, s_p_a_u() might simply lag behind quite a bit,
In the case above, when a driver detects that the hardware position is incorrect, it uses the last known value. Usually, this isn't off more than a few samples.
There is hardware that does not allow reading the current position. With such a device, the position you get is computed at every interrupt, i.e., you get the last period boundary.
or -- what I think is most likely -- because samples are popped in larger blocks form the hw playback buffer we reach the underrun much earlier than expected.
This happens, too. Many PCI devices read PCM data in blocks of 32 or 64 bytes. Many wavetable chips (Emu10k1, DS-1, CS46xx) read sample data in blocks of 256 or 512 samples. USB transfers blocks of at least 1 ms length, but often a multiple of that to reduce the number of USB completion interrupts.
After choosing hardware parameters, you can call snd_pcm_hw_params_is_block_transfer() to determine if the device transfers samples in comparatively large blocks. (The wavetable and USB drivers set this flag.) There is currently no function to determine the block size.
In the worst case, the current position isn't guaranteed to be more accurate than the last period boundary.
I do acknowledge that the way i use s_p_a_u() is probably a misuse of the API.
The API was primarily designed for applications that are woken up at period boundaries. Using s_p_a_u() to bypass the synchronization implied by period interrupts _is_ possible, but it cannot give you more precision than the hardware supports.
Now, considering all this I'd like to ask for a new API function that tells me how much time I really have before the next underrun.
Well, you could make the "some extra margin" above larger than one period.
Or monitor the device over some time and see what the smallest increment is you get in successive s_p_a_u() return values.
Best regards, Clemens
I wrote:
After choosing hardware parameters, you can call snd_pcm_hw_params_is_block_transfer() to determine if the device transfers samples in comparatively large blocks. (The wavetable and USB drivers set this flag.)
A quick grep over the kernel source shows that this flag is also set by many drivers that shouldn't, so it isn't very reliable.
Best regards, Clemens
On Tue, 20.01.09 09:29, Clemens Ladisch (clemens@ladisch.de) wrote:
I am now wondering why? One possibility of course is that s_p_a_u() is not reliable, due to driver issues (there were problems in the HDA driver about this, right?).
Some hardware doesn't realiably tell the current position in the buffer.
Hence it would be good to at least know the range of the current buffer index. i.e. s_p_a_u() is something like the lower bound where the playback index might be. With the function I am suggesting it would then be possible to basically query the upper bound wherhe the index might be.
Also, s_p_a_u() might simply lag behind quite a bit,
In the case above, when a driver detects that the hardware position is incorrect, it uses the last known value. Usually, this isn't off more than a few samples.
There is hardware that does not allow reading the current position. With such a device, the position you get is computed at every interrupt, i.e., you get the last period boundary.
In this case too it would be good to know how reliable the value is and have it as a range (i.e. two values), not just a lower bound.
or -- what I think is most likely -- because samples are popped in larger blocks form the hw playback buffer we reach the underrun much earlier than expected.
This happens, too. Many PCI devices read PCM data in blocks of 32 or 64 bytes. Many wavetable chips (Emu10k1, DS-1, CS46xx) read sample data in blocks of 256 or 512 samples. USB transfers blocks of at least 1 ms length, but often a multiple of that to reduce the number of USB completion interrupts.
Particularly with USB I experience that right after the device is started data is read much much faster from the playback buffer than expected. This feels as if the USB driver would at the beginning take all data from the playback buffer and copy it to some other buffer which was previously completely empty. Then after that second buffer is filled up the copying slows down to the expected speed. I currently deal with this by always halving the first wakeup time -- which works most of the time but is a hack.
With the function I suggest I'd be able to explicitly query how much time I have before I need to wake up.
After choosing hardware parameters, you can call snd_pcm_hw_params_is_block_transfer() to determine if the device transfers samples in comparatively large blocks. (The wavetable and USB drivers set this flag.) There is currently no function to determine the block size.
I think Takashi mentioned that s_p_h_i_b_t() is not really reliable and shouldn't be used --- it isn't that useful anyway if the block size isn't known.
In the worst case, the current position isn't guaranteed to be more accurate than the last period boundary.
I do acknowledge that the way i use s_p_a_u() is probably a misuse of the API.
The API was primarily designed for applications that are woken up at period boundaries. Using s_p_a_u() to bypass the synchronization implied by period interrupts _is_ possible, but it cannot give you more precision than the hardware supports.
Now, considering all this I'd like to ask for a new API function that tells me how much time I really have before the next underrun.
Well, you could make the "some extra margin" above larger than one period.
To save power I want to disable interrupts from the sound cards as much as possible. I.e. I set the minimal number of periods I can set. Usually that measn 1 or 2 periods. Having an extra margin that large would defeat the whole point of the "glitch-free" logic.
Or monitor the device over some time and see what the smallest increment is you get in successive s_p_a_u() return values.
Humpf, that seems like a hack to me.
Lennart
Lennart Poettering wrote:
Particularly with USB I experience that right after the device is started data is read much much faster from the playback buffer than expected. This feels as if the USB driver would at the beginning take all data from the playback buffer and copy it to some other buffer which was previously completely empty. Then after that second buffer is filled up the copying slows down to the expected speed.
Yes, the USB driver uses double-buffering, and the initial queueing of data for the USB controller is done at faster rate, to reduce the startup latency.
The size of the second buffer is about one period, but never more than 64 ms.
I currently deal with this by always halving the first wakeup time -- which works most of the time but is a hack.
In theory, you could deduce this behaviour from snd_pcm_hw_params_is_double(), but the USB driver forgets to set this flag.
With the function I suggest I'd be able to explicitly query how much time I have before I need to wake up.
I was thinking about a function that returns the hardware's block size (i.e., the precision of the avail/delay values), but that wouldn't be able to describe this behaviour of the USB driver. I think I might just remove this feature.
Well, you could make the "some extra margin" above larger than one period.
To save power I want to disable interrupts from the sound cards as much as possible.
In some cases (unusal hardware, but also USB), the period size affects the block size, i.e., smaller periods give better timing precision.
For this case, it might be useful to make the "pointer precision" a hardware parameter that can be restricted by an interval, like the other parameters.
Best regards, Clemens
On Tue, 20.01.09 19:48, Clemens Ladisch (clemens@ladisch.de) wrote:
I currently deal with this by always halving the first wakeup time -- which works most of the time but is a hack.
In theory, you could deduce this behaviour from snd_pcm_hw_params_is_double(), but the USB driver forgets to set this flag.
But still, with this flag I would only now that the startup sequence is "fast". But not how "fast".
I appears to me that it would make a lot more sense if the driver would simply tell me how long I may sleep instead of adding multiple new functions 1) that tell me if double-buffering is used and what the size of the second buffer is, 2) that tell me that data is pulled block-by-block from the buffer and what the block size is, and so on.
The function should look like this:
snd_pcm_sframes_t snd_pcm_busy_for(snd_pcm_t *pcm);
I called the prototype "busy for" since effectively the value I am looking for is the time the card will be busy with the data it already has, and doesn't need any new data.
Can I convince you guys that a function like this would make a lot of sense?
Instead of exporting all the gory details about blocks/double buffering and so on, just a simple high-level call.
With the function I suggest I'd be able to explicitly query how much time I have before I need to wake up.
I was thinking about a function that returns the hardware's block size (i.e., the precision of the avail/delay values), but that wouldn't be able to describe this behaviour of the USB driver. I think I might just remove this feature.
I am pretty sure there might be other drivers that work like this as well. Hence I think simply removing double buffering in the USB driver doesn't really solve the general issues I have.
Well, you could make the "some extra margin" above larger than one period.
To save power I want to disable interrupts from the sound cards as much as possible.
In some cases (unusal hardware, but also USB), the period size affects the block size, i.e., smaller periods give better timing precision.
It would be good if this could be controlled independantly from each other.
For this case, it might be useful to make the "pointer precision" a hardware parameter that can be restricted by an interval, like the other parameters.
That would be good.
Lennart
At Tue, 20 Jan 2009 21:29:34 +0100, Lennart Poettering wrote:
On Tue, 20.01.09 19:48, Clemens Ladisch (clemens@ladisch.de) wrote:
I currently deal with this by always halving the first wakeup time -- which works most of the time but is a hack.
In theory, you could deduce this behaviour from snd_pcm_hw_params_is_double(), but the USB driver forgets to set this flag.
But still, with this flag I would only now that the startup sequence is "fast". But not how "fast".
I appears to me that it would make a lot more sense if the driver would simply tell me how long I may sleep instead of adding multiple new functions 1) that tell me if double-buffering is used and what the size of the second buffer is, 2) that tell me that data is pulled block-by-block from the buffer and what the block size is, and so on.
The function should look like this:
snd_pcm_sframes_t snd_pcm_busy_for(snd_pcm_t *pcm);
I called the prototype "busy for" since effectively the value I am looking for is the time the card will be busy with the data it already has, and doesn't need any new data.
Isn't it snd_pcm_delay() that was originally designed for? Did you check my previous patch?
Takashi
On Wed, 21.01.09 01:39, Takashi Iwai (tiwai@suse.de) wrote:
The function should look like this:
snd_pcm_sframes_t snd_pcm_busy_for(snd_pcm_t *pcm);
I called the prototype "busy for" since effectively the value I am looking for is the time the card will be busy with the data it already has, and doesn't need any new data.
Isn't it snd_pcm_delay() that was originally designed for?
No. Let me summarize the meaning of snd_pcm_update_avail(), snd_pcm_delay() and my snd_pcm_busy_for() to opefully make clear where the differences are:
snd_pcm_update_avail() -- returns how many samples can be written right now without blocking.
snd_pcm_delay() -- returns how many samples will be played before the samples that are written now can be heard.
snd_pcm_busy_for() -- returns how many samples will be played before ALSA would enter an underrun situation if no further samples are written.
snd_pcm_update_avail() and snd_pcm_busy_for() return metrics that are solely dependant on the size and metrics of the hardware buffer and its current indexes. snd_pcm_delay() also includes information about any extra latency that comes after the playback buffer.
Onle snd_pcm_update_avail()/snd_pcm_busy_for() are influenced by "fast starts" as done by the USB driver's double buffering and by block-based transfer.
Hmm, I am trying my best to explain why I want this function and what exactly it should do. Any chance I can convince you guys that this function really matters for timer-based audio scheduling?
Did you check my previous patch?
You mean the one that makes snd_pcm_delay() for USB devices actually include the extra latency that comes after the playback buffer? No, I didn't check that one yet. It takes so much time to patch the kernel and test things... I'll try too finally do it this WE.
Lennart
At Thu, 22 Jan 2009 23:20:15 +0100, Lennart Poettering wrote:
On Wed, 21.01.09 01:39, Takashi Iwai (tiwai@suse.de) wrote:
The function should look like this:
snd_pcm_sframes_t snd_pcm_busy_for(snd_pcm_t *pcm);
I called the prototype "busy for" since effectively the value I am looking for is the time the card will be busy with the data it already has, and doesn't need any new data.
Isn't it snd_pcm_delay() that was originally designed for?
No. Let me summarize the meaning of snd_pcm_update_avail(), snd_pcm_delay() and my snd_pcm_busy_for() to opefully make clear where the differences are:
snd_pcm_update_avail() -- returns how many samples can be written right now without blocking.
snd_pcm_delay() -- returns how many samples will be played before the samples that are written now can be heard.
snd_pcm_busy_for() -- returns how many samples will be played before ALSA would enter an underrun situation if no further samples are written.
Well, in a ring-buffer model,
snd_pcm_busy_for = buffer_size - snd_pcm_update_avail
If a granularity matters (e.g. no accurate sample position update can be done), it would be
snd_pcm_busy_for = max{0, buffer_size - s_p_u_a - granularity}
The granularity is between 0 and period_size. The batch-mode is granularity = period_size.
snd_pcm_update_avail() and snd_pcm_busy_for() return metrics that are solely dependant on the size and metrics of the hardware buffer and its current indexes. snd_pcm_delay() also includes information about any extra latency that comes after the playback buffer.
Onle snd_pcm_update_avail()/snd_pcm_busy_for() are influenced by "fast starts" as done by the USB driver's double buffering and by block-based transfer.
Hmm, I am trying my best to explain why I want this function and what exactly it should do. Any chance I can convince you guys that this function really matters for timer-based audio scheduling?
I don't care much about the user-space API at this moment. My main concern is what kernel <-> user API is needed in addition or needed to be changed.
If it's a question how to pass the granularity to user-space, usually it's a constant value, and thus it can be put somewhere in the existing struct, or add a single ioctl.
OTOH, if it has to be implemented as a form of snd_pcm_busy_for(), the kernel needs the compuation like the above. That's my concern.
Did you check my previous patch?
You mean the one that makes snd_pcm_delay() for USB devices actually include the extra latency that comes after the playback buffer? No, I didn't check that one yet. It takes so much time to patch the kernel and test things... I'll try too finally do it this WE.
Hehe, changing API (should) take time, too :)
Takashi
Takashi Iwai wrote:
[...] My main concern is what kernel <-> user API is needed in addition or needed to be changed.
If it's a question how to pass the granularity to user-space, usually it's a constant value, and thus it can be put somewhere in the existing struct, or add a single ioctl.
Most PCI devices have 32 bytes; wavetable chips have a constant time (5.33 ms, i.e., resampled to 256 framesat 48 kHz). But the interesting cases are where the granularity is dependent on the period size, or where the application could choose some arbitrary value (USB). For these cases, it would be very useful to have the granularity as an interval in the PCM hardware parameters (or probably three: bytes/ frames/time).
In the case of granularity==period, this allows PulseAudio to detect that it has to work with small periods after it has set a small upper bound for the granularity. (This is exactly what the hw_param dependencies were designed for.)
OTOH, if it has to be implemented as a form of snd_pcm_busy_for(), the kernel needs the compuation like the above. That's my concern.
Instead of writing a callback in the USB driver to compute the time until the next underrun, I'd rather rip out that fast start code. So, no kernel computation is needed. :-)
Anyway, regardless of how the API looks, I see two compatibility concerns: * For many devices (legacy ISA, etc.), we just don't know the correct value. * What should alsa-lib do when it runs on an old kernel? It could return a worst-case estimate (period size), but this would cause PA to use small periods. Perhaps it would be better to return some error ("don't know").
Best regards, Clemens
At Fri, 23 Jan 2009 18:56:38 +0100, Clemens Ladisch wrote:
Takashi Iwai wrote:
[...] My main concern is what kernel <-> user API is needed in addition or needed to be changed.
If it's a question how to pass the granularity to user-space, usually it's a constant value, and thus it can be put somewhere in the existing struct, or add a single ioctl.
Most PCI devices have 32 bytes; wavetable chips have a constant time (5.33 ms, i.e., resampled to 256 framesat 48 kHz). But the interesting cases are where the granularity is dependent on the period size, or where the application could choose some arbitrary value (USB). For these cases, it would be very useful to have the granularity as an interval in the PCM hardware parameters (or probably three: bytes/ frames/time).
Right. I noticed it when I wrote a patch for snd_pcm_delay() extension for usb-audio device.
Contradicting to my previous comment, but the variable granularity is one thing we can consider. But, it may be dependent how accurate it must be. If snd_pcm_busy_for() should return the maximal safe sleep time, then a constant value would work well.
In the case of granularity==period, this allows PulseAudio to detect that it has to work with small periods after it has set a small upper bound for the granularity. (This is exactly what the hw_param dependencies were designed for.)
Hm, this reminds me that a granularity isn't only what hardware provides but also what app can request, directly or indirectly. It's a good point. On some hardwares, you can't abandon the small period size if you want a smaller granularity.
OTOH, if it has to be implemented as a form of snd_pcm_busy_for(), the kernel needs the compuation like the above. That's my concern.
Instead of writing a callback in the USB driver to compute the time until the next underrun, I'd rather rip out that fast start code. So, no kernel computation is needed. :-)
Anyway, regardless of how the API looks, I see two compatibility concerns:
- For many devices (legacy ISA, etc.), we just don't know the correct value.
Right. But for these we can assume granularity=1 unless someone detects the breakage.
- What should alsa-lib do when it runs on an old kernel? It could return a worst-case estimate (period size), but this would cause PA to use small periods. Perhaps it would be better to return some error ("don't know").
I think returning undefined is a better choice.
thanks,
Takashi
On Fri, 23.01.09 18:56, Clemens Ladisch (clemens@ladisch.de) wrote:
Takashi Iwai wrote:
[...] My main concern is what kernel <-> user API is needed in addition or needed to be changed.
If it's a question how to pass the granularity to user-space, usually it's a constant value, and thus it can be put somewhere in the existing struct, or add a single ioctl.
Most PCI devices have 32 bytes; wavetable chips have a constant time (5.33 ms, i.e., resampled to 256 framesat 48 kHz). But the interesting cases are where the granularity is dependent on the period size, or where the application could choose some arbitrary value (USB). For these cases, it would be very useful to have the granularity as an interval in the PCM hardware parameters (or probably three: bytes/ frames/time).
In the case of granularity==period, this allows PulseAudio to detect that it has to work with small periods after it has set a small upper bound for the granularity. (This is exactly what the hw_param dependencies were designed for.)
OTOH, if it has to be implemented as a form of snd_pcm_busy_for(), the kernel needs the compuation like the above. That's my concern.
Instead of writing a callback in the USB driver to compute the time until the next underrun, I'd rather rip out that fast start code. So, no kernel computation is needed. :-)
While I think it would be good not have this kind of double-buffering I wonder if this is really future-proof. i.e. can this done with every driver that uses 'fast starts'?
Anyway, regardless of how the API looks, I see two compatibility concerns:
- For many devices (legacy ISA, etc.), we just don't know the correct value.
But it should be possible to pick a safe boundary, shouldn't it?
- What should alsa-lib do when it runs on an old kernel? It could return a worst-case estimate (period size), but this would cause PA to use small periods. Perhaps it would be better to return some error ("don't know").
If we'd do it the the busy_for() API we could simply return buffer_size - avail_update() - extra_margin. Or simply return ENOSUPP. That would be fine too.
Lennart
Lennart Poettering wrote:
On Fri, 23.01.09 18:56, Clemens Ladisch (clemens@ladisch.de) wrote:
Instead of writing a callback in the USB driver to compute the time until the next underrun, I'd rather rip out that fast start code.
(Done.)
So, no kernel computation is needed. :-)
While I think it would be good not have this kind of double-buffering I wonder if this is really future-proof. i.e. can this done with every driver that uses 'fast starts'?
Yes, because the USB driver was the only one that did this.
There are other drivers that use double-buffering (and the USB driver still does), but there the playback speed does not change, i.e., the stream is not more underrun-prone when starting.
- For many devices (legacy ISA, etc.), we just don't know the correct value.
But it should be possible to pick a safe boundary, shouldn't it?
In theory, the _safe_ boundary is one period. In practice, ISA devices cannot afford to prefetch much data due to the low bus bandwidth, so one (frame) should be OK. (And we _know_ devices that do whole-period double-buffering because the code is right there in the driver.)
Best regards, Clemens
On Fri, 23.01.09 18:13, Takashi Iwai (tiwai@suse.de) wrote:
snd_pcm_update_avail() -- returns how many samples can be written right now without blocking.
snd_pcm_delay() -- returns how many samples will be played before the samples that are written now can be heard.
snd_pcm_busy_for() -- returns how many samples will be played before ALSA would enter an underrun situation if no further samples are written.
Well, in a ring-buffer model,
snd_pcm_busy_for = buffer_size - snd_pcm_update_avail
That is what I currently use in PA and which turns out not to work so well. Due to granularity and due to "fast starts" such as done by the USB driver.
If a granularity matters (e.g. no accurate sample position update can be done), it would be
snd_pcm_busy_for = max{0, buffer_size - s_p_u_a - granularity}
The granularity is between 0 and period_size. The batch-mode is granularity = period_size.
It would be good to have something to query the current granularity. (as Clemens suggested)
snd_pcm_update_avail() and snd_pcm_busy_for() return metrics that are solely dependant on the size and metrics of the hardware buffer and its current indexes. snd_pcm_delay() also includes information about any extra latency that comes after the playback buffer.
Onle snd_pcm_update_avail()/snd_pcm_busy_for() are influenced by "fast starts" as done by the USB driver's double buffering and by block-based transfer.
Hmm, I am trying my best to explain why I want this function and what exactly it should do. Any chance I can convince you guys that this function really matters for timer-based audio scheduling?
I don't care much about the user-space API at this moment. My main concern is what kernel <-> user API is needed in addition or needed to be changed.
If it's a question how to pass the granularity to user-space, usually it's a constant value, and thus it can be put somewhere in the existing struct, or add a single ioctl.
OTOH, if it has to be implemented as a form of snd_pcm_busy_for(), the kernel needs the compuation like the above. That's my concern.
Hmm, maybe there could be a default implementation that works as you suggested above? And only drivers that have a different buffer model (such as 'fast starts') or a granularity that is >1 sample would need to overwrite that default implementation.
The reason why I'd prefer having snd_pcm_busy_for() instead of seperate APIs to query the granularity and 'fast starts' is that this forces the underlying drivers into a specific buffer model. However especially with userspace drivers the buffering used might be very complex and very different from how current hardware drivers do it. Hence I'd prefer a high-level API that leaves room to different buffering models including those which might come in the future than breaking down buffering into primitive parameters that userspace has to make sense of in a very specific scheme.
Lennart
2009/1/21 Takashi Iwai tiwai@suse.de:
At Tue, 20 Jan 2009 21:29:34 +0100, Lennart Poettering wrote:
I called the prototype "busy for" since effectively the value I am looking for is the time the card will be busy with the data it already has, and doesn't need any new data.
Isn't it snd_pcm_delay() that was originally designed for? Did you check my previous patch?
No. snd_pcm_delay() was designed for aiding audio/video sync. I.e. If I write a sample to the buffer now, it will play on the speakers in X samples time. In this way, if I have a sample that must play in X samples time, I write it now. If I have a sample that must play in X+3 samples time, I add 3 samples of padding first.
participants (4)
-
Clemens Ladisch
-
James Courtier-Dutton
-
Lennart Poettering
-
Takashi Iwai