On 10/5/2021 4:24 AM, Pierre-Louis Bossart wrote:
External email: Use caution opening links or attachments
We've been adding a 'deep buffer' PCM device to several SOF topologies in order to reduce power consumption. The typical use-case would be music playback over a headset: this additional PCM device provides more buffering and longer latencies, leaving the rest of the system sleep for longer periods. Notifications and 'regular' low-latency audio playback would still use the 'normal' PCM device and be mixed with the 'deep buffer' before rendering on the headphone endpoint. The tentative direction would be to expose this alternate device to PulseAudio/PipeWire/CRAS via the UCM SectionModifier definitions.
That seemed a straightforward topology change until our automated validation stress tests started reporting issues on SoundWire platforms, when e.g. two START triggers might be send and conversely the STOP trigger is never sent. The SoundWire stream state management flagged inconsistent states when the two 'normal' and 'deep buffer' devices are used concurrently with rapid play/stop/pause monkey testing.
Looking at the soc-pcm.c code, it seems that the BE state management needs a lot of love.
a) there is no consistent protection for the BE state. In some parts of the code, the state updates are protected by a spinlock but in the trigger they are not. When we open/play/close the two PCM devices in stress tests, we end-up testing a state that is being modified. That can't be good.
b) there is a conceptual deadlock: on stop we check the FE states to see if a shared BE can be stopped, but since we trigger the BE first the FE states have not been modified yet, so the TRIGGER_STOP is never sent.
This patchset suggests a transition from a spinlock to a mutex, an extended protection when walking through the BE list, and the use of a refcount to decide when to trigger the BE. With these patches I am able to run our entire validation suite without any issues with this new 'deep buffer' topology, and no regressions on existing solutions [1]
One might ask 'how come we didn't see this earlier'? The answer is probably that the .trigger callbacks in most implementations seems to perform DAPM operations, and sending the triggers multiple times is not an issue. In the case of SoundWire, we do use the .trigger callback to reconfigure the bus using the 'bank switch' mechanism. It could be acceptable to tolerate a trigger multiple times, but the deadlock on stop cannot be fixed at the SoundWire layer alone.
I chose to send this patchset as an RFCv2 to gather more feedback and make use others know about DPCM issues. We're going to spend more time on this but if others can provide feedback/test results it would be greatly appreciated.
Opens:
- is this the right solution? The DPCM code is far from simple, has
notions such as SND_SOC_DPCM_UPDATE_NO and 'trigger_pending' that I have no background on.
- There are other reports of kernel oopses [2] that seem related to
the lack of protection. I'd be good to confirm if this patchset solve these problems as well.
[1] https://github.com/thesofproject/linux/pull/3146 [2] https://lore.kernel.org/alsa-devel/002f01d7b4f5$c030f4a0$4092dde0$@samsung.c...
changes since RFC v1: Removed unused function Removed exported symbols only used in soc-pcm.c, used static instead Use a mutex instead of a spinlock Protect all for_each_dpcm_be() loops Fix bugs introduced in the refcount
Pierre-Louis Bossart (5): ASoC: soc-pcm: remove snd_soc_dpcm_fe_can_update() ASoC: soc-pcm: don't export local functions, use static ASoC: soc-pcm: replace dpcm_lock with dpcm_mutex ASoC: soc-pcm: protect for_each_dpcm_be() loops with dpcm_mutex ASoC: soc-pcm: test refcount before triggering
Thank you Pierre-Louis for adding me to this thread.
I did a quick test of your patches on my Tegra board and seeing issues with multiple streams. For instance, I ran it for a 2x1 mixer configuration and hitting below:
[ 277.661886] BUG: scheduling while atomic: aplay/1306/0x00000100 [ 287.713193] BUG: spinlock cpu recursion on CPU#0, aplay/1307 [ 287.719138] lock: 0xffff00008cc820f0, .magic: dead4ead, .owner: aplay/1306, .owner_cpu: 0 [ 287.727319] CPU: 0 PID: 1307 Comm: aplay Tainted: G W 5.15.0-rc3-next-20210927-00026-gffdabce987b1 #12 [ 287.737783] Hardware name: NVIDIA Jetson AGX Xavier Developer Kit (DT) [ 287.744228] Call trace: [ 287.746656] dump_backtrace+0x0/0x1c0 [ 287.750300] show_stack+0x18/0x28 [ 287.753604] dump_stack_lvl+0x7c/0xa8 [ 287.757236] dump_stack+0x18/0x34 [ 287.760536] spin_dump+0x70/0x90 [ 287.763732] do_raw_spin_lock+0xd8/0x120 [ 287.767615] _raw_spin_lock_irq+0x60/0x80 [ 287.771581] snd_pcm_stream_lock_irq+0x20/0x48 [snd_pcm] [ 287.776853] snd_pcm_drain+0x1ec/0x348 [snd_pcm] [ 287.781421] snd_pcm_common_ioctl+0xacc/0x1938 [snd_pcm] [ 287.786685] snd_pcm_ioctl+0x2c/0x48 [snd_pcm] [ 287.791101] __arm64_sys_ioctl+0xb0/0xf0 [ 287.794982] invoke_syscall+0x44/0x108 [ 287.798683] el0_svc_common.constprop.3+0x74/0x100 [ 287.803416] do_el0_svc+0x24/0x90 [ 287.806687] el0_svc+0x20/0x60 [ 287.809705] el0t_64_sync_handler+0x94/0xb8 [ 287.813839] el0t_64_sync+0x180/0x184
And in some case just below:
[ 1074.212276] BUG: scheduling while atomic: aplay/12327/0x00000100 [ 1095.227509] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 1095.233443] rcu: 0-...0: (1 GPs behind) idle=4af/1/0x4000000000000004 softirq=19902/19902 fqs=2626 [ 1095.242528] rcu: 2-...0: (1 GPs behind) idle=9d5/1/0x4000000000000000 softirq=22707/22707 fqs=262