[RFC PATCH v2 0/5] ASoC: soc-pcm: fix trigger race conditions with shared BE

Tue Oct 5 08:36:54 CEST 2021

On 10/5/2021 4:24 AM, Pierre-Louis Bossart wrote:
> External email: Use caution opening links or attachments
>
>
> We've been adding a 'deep buffer' PCM device to several SOF topologies
> in order to reduce power consumption. The typical use-case would be
> music playback over a headset: this additional PCM device provides
> more buffering and longer latencies, leaving the rest of the system
> sleep for longer periods. Notifications and 'regular' low-latency
> audio playback would still use the 'normal' PCM device and be mixed
> with the 'deep buffer' before rendering on the headphone endpoint. The
> tentative direction would be to expose this alternate device to
> PulseAudio/PipeWire/CRAS via the UCM SectionModifier definitions.
>
> That seemed a straightforward topology change until our automated
> validation stress tests started reporting issues on SoundWire
> platforms, when e.g. two START triggers might be send and conversely
> the STOP trigger is never sent. The SoundWire stream state management
> flagged inconsistent states when the two 'normal' and 'deep buffer'
> devices are used concurrently with rapid play/stop/pause monkey
> testing.
>
> Looking at the soc-pcm.c code, it seems that the BE state
> management needs a lot of love.
>
> a) there is no consistent protection for the BE state. In some parts
> of the code, the state updates are protected by a spinlock but in the
> trigger they are not. When we open/play/close the two PCM devices in
> stress tests, we end-up testing a state that is being modified. That
> can't be good.
>
> b) there is a conceptual deadlock: on stop we check the FE states to
> see if a shared BE can be stopped, but since we trigger the BE first
> the FE states have not been modified yet, so the TRIGGER_STOP is never
> sent.
>
> This patchset suggests a transition from a spinlock to a mutex, an
> extended protection when walking through the BE list, and the use of a
> refcount to decide when to trigger the BE. With these patches I am
> able to run our entire validation suite without any issues with this
> new 'deep buffer' topology, and no regressions on existing solutions [1]
>
> One might ask 'how come we didn't see this earlier'? The answer is
> probably that the .trigger callbacks in most implementations seems to
> perform DAPM operations, and sending the triggers multiple times is
> not an issue. In the case of SoundWire, we do use the .trigger
> callback to reconfigure the bus using the 'bank switch' mechanism. It
> could be acceptable to tolerate a trigger multiple times, but the
> deadlock on stop cannot be fixed at the SoundWire layer alone.
>
> I chose to send this patchset as an RFCv2 to gather more feedback and
> make use others know about DPCM issues. We're going to spend more time
> on this but if others can provide feedback/test results it would be
> greatly appreciated.
>
> Opens:
>
> 1) is this the right solution? The DPCM code is far from simple, has
> notions such as SND_SOC_DPCM_UPDATE_NO and 'trigger_pending' that I
> have no background on.
>
> 2) There are other reports of kernel oopses [2] that seem related to
> the lack of protection. I'd be good to confirm if this patchset solve
> these problems as well.
>
> [1] https://github.com/thesofproject/linux/pull/3146
> [2] https://lore.kernel.org/alsa-devel/002f01d7b4f5$c030f4a0$4092dde0$@samsung.com/
>
> changes since RFC v1:
> Removed unused function
> Removed exported symbols only used in soc-pcm.c, used static instead
> Use a mutex instead of a spinlock
> Protect all for_each_dpcm_be() loops
> Fix bugs introduced in the refcount

> Pierre-Louis Bossart (5):
>    ASoC: soc-pcm: remove snd_soc_dpcm_fe_can_update()
>    ASoC: soc-pcm: don't export local functions, use static
>    ASoC: soc-pcm: replace dpcm_lock with dpcm_mutex
>    ASoC: soc-pcm: protect for_each_dpcm_be() loops with dpcm_mutex
>    ASoC: soc-pcm: test refcount before triggering

Thank you Pierre-Louis for adding me to this thread.

I did a quick test of your patches on my Tegra board and seeing issues 
with multiple streams. For instance, I ran it for a 2x1 mixer 
configuration and hitting below:

[  277.661886] BUG: scheduling while atomic: aplay/1306/0x00000100
[  287.713193] BUG: spinlock cpu recursion on CPU#0, aplay/1307
[  287.719138]  lock: 0xffff00008cc820f0, .magic: dead4ead, .owner: 
aplay/1306, .owner_cpu: 0
[  287.727319] CPU: 0 PID: 1307 Comm: aplay Tainted: G W         
5.15.0-rc3-next-20210927-00026-gffdabce987b1 #12
[  287.737783] Hardware name: NVIDIA Jetson AGX Xavier Developer Kit (DT)
[  287.744228] Call trace:
[  287.746656]  dump_backtrace+0x0/0x1c0
[  287.750300]  show_stack+0x18/0x28
[  287.753604]  dump_stack_lvl+0x7c/0xa8
[  287.757236]  dump_stack+0x18/0x34
[  287.760536]  spin_dump+0x70/0x90
[  287.763732]  do_raw_spin_lock+0xd8/0x120
[  287.767615]  _raw_spin_lock_irq+0x60/0x80
[  287.771581]  snd_pcm_stream_lock_irq+0x20/0x48 [snd_pcm]
[  287.776853]  snd_pcm_drain+0x1ec/0x348 [snd_pcm]
[  287.781421]  snd_pcm_common_ioctl+0xacc/0x1938 [snd_pcm]
[  287.786685]  snd_pcm_ioctl+0x2c/0x48 [snd_pcm]
[  287.791101]  __arm64_sys_ioctl+0xb0/0xf0
[  287.794982]  invoke_syscall+0x44/0x108
[  287.798683]  el0_svc_common.constprop.3+0x74/0x100
[  287.803416]  do_el0_svc+0x24/0x90
[  287.806687]  el0_svc+0x20/0x60
[  287.809705]  el0t_64_sync_handler+0x94/0xb8
[  287.813839]  el0t_64_sync+0x180/0x184

And in some case just below:

[ 1074.212276] BUG: scheduling while atomic: aplay/12327/0x00000100
[ 1095.227509] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1095.233443] rcu:     0-...0: (1 GPs behind) 
idle=4af/1/0x4000000000000004 softirq=19902/19902 fqs=2626
[ 1095.242528] rcu:     2-...0: (1 GPs behind) 
idle=9d5/1/0x4000000000000000 softirq=22707/22707 fqs=262