[PATCH] ASoC: Intel: Atom: use hardware counter to update hw_ptr
The ring buffer counter runs faster than hardware counter if the period size in hw_param is larger than 240. Although the differce is not much (around 2k frames), it causes false underrun in CRAS sometimes because it's using 256 frames as period size in hw_param. Using the hardware counter could provide precise hw_ptr to user space and avoid the false underrun in CRAS.
Signed-off-by: Brent Lu brent.lu@intel.com --- sound/soc/intel/atom/sst/sst_drv_interface.c | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-)
diff --git a/sound/soc/intel/atom/sst/sst_drv_interface.c b/sound/soc/intel/atom/sst/sst_drv_interface.c index 7624953..1949ad9 100644 --- a/sound/soc/intel/atom/sst/sst_drv_interface.c +++ b/sound/soc/intel/atom/sst/sst_drv_interface.c @@ -485,7 +485,6 @@ static inline int sst_calc_tstamp(struct intel_sst_drv *ctx, struct snd_pcm_substream *substream, struct snd_sst_tstamp *fw_tstamp) { - size_t delay_bytes, delay_frames; size_t buffer_sz; u32 pointer_bytes, pointer_samples;
@@ -493,22 +492,14 @@ static inline int sst_calc_tstamp(struct intel_sst_drv *ctx, fw_tstamp->ring_buffer_counter); dev_dbg(ctx->dev, "mrfld hardware_counter %llu in bytes\n", fw_tstamp->hardware_counter); - if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK) - delay_bytes = (size_t) (fw_tstamp->ring_buffer_counter - - fw_tstamp->hardware_counter); - else - delay_bytes = (size_t) (fw_tstamp->hardware_counter - - fw_tstamp->ring_buffer_counter); - delay_frames = bytes_to_frames(substream->runtime, delay_bytes); + buffer_sz = snd_pcm_lib_buffer_bytes(substream); - div_u64_rem(fw_tstamp->ring_buffer_counter, buffer_sz, &pointer_bytes); + div_u64_rem(fw_tstamp->hardware_counter, buffer_sz, &pointer_bytes); pointer_samples = bytes_to_samples(substream->runtime, pointer_bytes);
- dev_dbg(ctx->dev, "pcm delay %zu in bytes\n", delay_bytes); - info->buffer_ptr = pointer_samples / substream->runtime->channels; + info->pcm_delay = 0;
- info->pcm_delay = delay_frames; dev_dbg(ctx->dev, "buffer ptr %llu pcm_delay rep: %llu\n", info->buffer_ptr, info->pcm_delay); return 0;
On 7/26/20 11:08 AM, Brent Lu wrote:
The ring buffer counter runs faster than hardware counter if the period size in hw_param is larger than 240. Although the differce is not much (around 2k frames), it causes false underrun in CRAS sometimes because it's using 256 frames as period size in hw_param.
All the Atom firmware assumes data chunks in multiples of 1ms (typically 5, 10 or 20ms). I have never seen anyone use 256 frames, that's asking for trouble really.
it's actually the same with Skylake and SOF in most cases.
Is this a 'real' problem or a problem detected by the Chrome ALSA compliance tests, in the latter case that would hint at a too generic value of min_period.
Using the hardware counter could provide precise hw_ptr to user space and avoid the false underrun in CRAS.
Signed-off-by: Brent Lu brent.lu@intel.com
sound/soc/intel/atom/sst/sst_drv_interface.c | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-)
diff --git a/sound/soc/intel/atom/sst/sst_drv_interface.c b/sound/soc/intel/atom/sst/sst_drv_interface.c index 7624953..1949ad9 100644 --- a/sound/soc/intel/atom/sst/sst_drv_interface.c +++ b/sound/soc/intel/atom/sst/sst_drv_interface.c @@ -485,7 +485,6 @@ static inline int sst_calc_tstamp(struct intel_sst_drv *ctx, struct snd_pcm_substream *substream, struct snd_sst_tstamp *fw_tstamp) {
- size_t delay_bytes, delay_frames; size_t buffer_sz; u32 pointer_bytes, pointer_samples;
@@ -493,22 +492,14 @@ static inline int sst_calc_tstamp(struct intel_sst_drv *ctx, fw_tstamp->ring_buffer_counter); dev_dbg(ctx->dev, "mrfld hardware_counter %llu in bytes\n", fw_tstamp->hardware_counter);
- if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK)
delay_bytes = (size_t) (fw_tstamp->ring_buffer_counter -
fw_tstamp->hardware_counter);
- else
delay_bytes = (size_t) (fw_tstamp->hardware_counter -
fw_tstamp->ring_buffer_counter);
- delay_frames = bytes_to_frames(substream->runtime, delay_bytes);
- buffer_sz = snd_pcm_lib_buffer_bytes(substream);
- div_u64_rem(fw_tstamp->ring_buffer_counter, buffer_sz, &pointer_bytes);
- div_u64_rem(fw_tstamp->hardware_counter, buffer_sz, &pointer_bytes); pointer_samples = bytes_to_samples(substream->runtime, pointer_bytes);
- dev_dbg(ctx->dev, "pcm delay %zu in bytes\n", delay_bytes);
- info->buffer_ptr = pointer_samples / substream->runtime->channels;
- info->pcm_delay = 0;
and that seems also wrong? Why would the delay be zero?
- info->pcm_delay = delay_frames; dev_dbg(ctx->dev, "buffer ptr %llu pcm_delay rep: %llu\n", info->buffer_ptr, info->pcm_delay); return 0;
All the Atom firmware assumes data chunks in multiples of 1ms (typically 5, 10 or 20ms). I have never seen anyone use 256 frames, that's asking for trouble really.
it's actually the same with Skylake and SOF in most cases.
Is this a 'real' problem or a problem detected by the Chrome ALSA compliance tests, in the latter case that would hint at a too generic value of min_period.
I've told them 240 is more reasonable since the sample rate is 48000 and our android bsp also uses 240 for multimedia use case for many years but they don't want to change the CRAS setting for some reason.
Google says it's a real issue for them: "The driver consumes frames quickly at the beginning will make CRAS underrun because CRAS fills samples in the fixed rate."
Currently they implement constraint in machine driver of atom machines to force 240 period size so CRAS is using 240 for atom platforms and 256 for other big cores.
I'm curious why not just using hardware counter to update hw_ptr and get rid of the period setting in hw_param? It seems to me the ring buffer counter does not reflect the real status.
Regards, Brent
and that seems also wrong? Why would the delay be zero?
info->pcm_delay is the difference between ring buffer counter and hardware counter. Because the ring buffer counter (hw_ptr) is running faster then it should, so we add the info->pcm_delay to substream->runtime->delay as compensation.
Therefore, application could use snd_pcm_delay() to get the actual frame number which are still in buffer.
snd_pcm_delay() = buffer_size - snd_pcm_avail() + runtime->delay
We don't need pcm_delay to compensate anything if using hardware counter.
- info->pcm_delay = delay_frames; dev_dbg(ctx->dev, "buffer ptr %llu pcm_delay rep: %llu\n", info->buffer_ptr, info->pcm_delay); return 0;
On 7/27/20 9:28 PM, Lu, Brent wrote:
All the Atom firmware assumes data chunks in multiples of 1ms (typically 5, 10 or 20ms). I have never seen anyone use 256 frames, that's asking for trouble really.
it's actually the same with Skylake and SOF in most cases.
Is this a 'real' problem or a problem detected by the Chrome ALSA compliance tests, in the latter case that would hint at a too generic value of min_period.
I've told them 240 is more reasonable since the sample rate is 48000 and our android bsp also uses 240 for multimedia use case for many years but they don't want to change the CRAS setting for some reason.
Google says it's a real issue for them: "The driver consumes frames quickly at the beginning will make CRAS underrun because CRAS fills samples in the fixed rate."
Currently they implement constraint in machine driver of atom machines to force 240 period size so CRAS is using 240 for atom platforms and 256 for other big cores.
So if there are already quirks in atom machine drivers to change the period size, why is this patch necessary?
I'm curious why not just using hardware counter to update hw_ptr and get rid of the period setting in hw_param? It seems to me the ring buffer counter does not reflect the real status.
I don't recall precisely what this hardware counter does. I vaguely recall it's tied to the 19.2MHz external timer which is also used to schedule the 1ms SBA mixer and the SSP IOs. And by comparing with the ring buffer pointer you can infer the delay inside the DSP. I think you are also making an assumption that all streams are tied to the output rate, but that's most likely a bad assumption. The hard-coded topology supported media, speech and compressed data and the consumption rate on the DMA side could be faster with some buffering happening in the DSP. It's not a passthrough DMA in all cases.
This is really legacy code that no one really fully understands nor plans on improving, it'd be a bad idea to change the pcm pointer reports now, 6 years after the initial code release and after all initial contributors moved on. It's what it is.
and that seems also wrong? Why would the delay be zero?
info->pcm_delay is the difference between ring buffer counter and hardware counter. Because the ring buffer counter (hw_ptr) is running faster then it should, so we add the info->pcm_delay to substream->runtime->delay as compensation.
Therefore, application could use snd_pcm_delay() to get the actual frame number which are still in buffer.
snd_pcm_delay() = buffer_size - snd_pcm_avail() + runtime->delay
We don't need pcm_delay to compensate anything if using hardware counter.
If you force info->pcm_delay to be zero, then runtime->delay is also zero:
see sst_soc_pointer(): substream->runtime->delay = str_info->pcm_delay;
- info->pcm_delay = delay_frames;
So if there are already quirks in atom machine drivers to change the period size, why is this patch necessary?
The story is: google implemented the constraint but doesn't know why it works so asked us to explain. After checking the two counters I realized the increase of ring buffer pointer follows the period size setting in hw_param (256) but the period of interrupt is always 5ms instead of 5.33 so it's running little bit too fast. It seems the LPE keeps tracking the difference of two counters. When the difference exceeds 2160 samples, the next interrupt will be canceled so the hardware counter could catch up a little.
[ 43.208299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 107520 hardware_counter 98880 pcm delay 8640 (in bytes) [ 43.208306] intel_sst_acpi 808622A8:00: buffer ptr 26880 pcm_delay rep: 2160 [ 43.208321] sound pcmC1D0p: [Q] pos 26880 hw_ptr 26880 appl_ptr 40000 avail 191680 => one interrupt is skipped. [ 43.218299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 108544 hardware_counter 100800 pcm delay 7744 (in bytes) [ 43.218307] intel_sst_acpi 808622A8:00: buffer ptr 27136 pcm_delay rep: 1936 [ 43.218336] sound pcmC1D0p: [Q] pos 27136 hw_ptr 27136 appl_ptr 40000 avail 191936
So I think why not using the hardware counter? It increases 240 samples every 5ms perfectly match the 48000 sample rate. The test result is good but I know there must be a reason for the original designer to use ring buffer counter instead of hardware counter. I uploaded this patch to see if anyone still remember the reason and share some insight with me.
I totally agree that we shouldn't touch this part of design. Do you think it make sense to add a constraint to enforce the period size in machine driver? If yes then I would upload patches for Chrome atom machines for google.
Regards, Brent
I'm curious why not just using hardware counter to update hw_ptr and get rid of the period setting in hw_param? It seems to me the ring buffer counter does not reflect the real status.
I don't recall precisely what this hardware counter does. I vaguely recall it's tied to the 19.2MHz external timer which is also used to schedule the 1ms SBA mixer and the SSP IOs. And by comparing with the ring buffer pointer you can infer the delay inside the DSP. I think you are also making an assumption that all streams are tied to the output rate, but that's most likely a bad assumption. The hard-coded topology supported media, speech and compressed data and the consumption rate on the DMA side could be faster with some buffering happening in the DSP. It's not a passthrough DMA in all cases.
This is really legacy code that no one really fully understands nor plans on improving, it'd be a bad idea to change the pcm pointer reports now, 6 years after the initial code release and after all initial contributors moved on. It's what it is.
On 7/28/20 12:02 PM, Lu, Brent wrote:
So if there are already quirks in atom machine drivers to change the period size, why is this patch necessary?
The story is: google implemented the constraint but doesn't know why it works so asked us to explain. After checking the two counters I realized the increase of ring buffer pointer follows the period size setting in hw_param (256) but the period of interrupt is always 5ms instead of 5.33 so it's running little bit too fast. It seems the LPE keeps tracking the difference of two counters. When the difference exceeds 2160 samples, the next interrupt will be canceled so the hardware counter could catch up a little.
[ 43.208299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 107520 hardware_counter 98880 pcm delay 8640 (in bytes) [ 43.208306] intel_sst_acpi 808622A8:00: buffer ptr 26880 pcm_delay rep: 2160 [ 43.208321] sound pcmC1D0p: [Q] pos 26880 hw_ptr 26880 appl_ptr 40000 avail 191680 => one interrupt is skipped. [ 43.218299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 108544 hardware_counter 100800 pcm delay 7744 (in bytes) [ 43.218307] intel_sst_acpi 808622A8:00: buffer ptr 27136 pcm_delay rep: 1936 [ 43.218336] sound pcmC1D0p: [Q] pos 27136 hw_ptr 27136 appl_ptr 40000 avail 191936
So I think why not using the hardware counter? It increases 240 samples every 5ms perfectly match the 48000 sample rate. The test result is good but I know there must be a reason for the original designer to use ring buffer counter instead of hardware counter. I uploaded this patch to see if anyone still remember the reason and share some insight with me.
I totally agree that we shouldn't touch this part of design. Do you think it make sense to add a constraint to enforce the period size in machine driver? If yes then I would upload patches for Chrome atom machines for google.
I think it'd make sense to add this constraint, either in the machine driver or in the platform driver, so that we don't change the position updates and introduce more issues by accident by doing so. As you rightly said, I don't think anyone tested periods multiple of 256 samples so it's not a regression, more aligning with the internal design.
On Wed, Jul 29, 2020 at 1:31 AM Pierre-Louis Bossart pierre-louis.bossart@linux.intel.com wrote:
On 7/28/20 12:02 PM, Lu, Brent wrote:
So if there are already quirks in atom machine drivers to change the period size, why is this patch necessary?
The story is: google implemented the constraint but doesn't know why it works so asked us to explain. After checking the two counters I realized the increase of ring buffer pointer follows the period size setting in hw_param (256) but the period of interrupt is always 5ms instead of 5.33 so it's running little bit too fast. It seems the LPE keeps tracking the difference of two counters. When the difference exceeds 2160 samples, the next interrupt will be canceled so the hardware counter could catch up a little.
[ 43.208299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 107520 hardware_counter 98880 pcm delay 8640 (in bytes) [ 43.208306] intel_sst_acpi 808622A8:00: buffer ptr 26880 pcm_delay rep: 2160 [ 43.208321] sound pcmC1D0p: [Q] pos 26880 hw_ptr 26880 appl_ptr 40000 avail 191680 => one interrupt is skipped. [ 43.218299] intel_sst_acpi 808622A8:00: mrfld ring_buffer_counter 108544 hardware_counter 100800 pcm delay 7744 (in bytes) [ 43.218307] intel_sst_acpi 808622A8:00: buffer ptr 27136 pcm_delay rep: 1936 [ 43.218336] sound pcmC1D0p: [Q] pos 27136 hw_ptr 27136 appl_ptr 40000 avail 191936
So I think why not using the hardware counter? It increases 240 samples every 5ms perfectly match the 48000 sample rate. The test result is good but I know there must be a reason for the original designer to use ring buffer counter instead of hardware counter. I uploaded this patch to see if anyone still remember the reason and share some insight with me.
I totally agree that we shouldn't touch this part of design. Do you think it make sense to add a constraint to enforce the period size in machine driver? If yes then I would upload patches for Chrome atom machines for google.
I think it'd make sense to add this constraint, either in the machine driver or in the platform driver, so that we don't change the position updates and introduce more issues by accident by doing so. As you rightly said, I don't think anyone tested periods multiple of 256 samples so it's not a regression, more aligning with the internal design.
Thanks for the explanation and discussion. A slight correction in this thread is that CRAS does not set period size to 256. It just uses whatever value that driver wants to use, which happens to be 256. If the driver can limit the selection to only correct values that would be the best. Adding constraints in the machine driver or platform driver looks good. Prefer to adding in platform driver so we don't need to add for each machine driver. Thanks!
participants (4)
-
Brent Lu
-
Cheng-yi Chiang
-
Lu, Brent
-
Pierre-Louis Bossart