Re: [alsa-devel] kernel freeze regression relating to imx-sdma
Hi Lucas,
Thanks for the feedback.
-----Original Message----- From: Lucas Stach [mailto:l.stach@pengutronix.de] Sent: October 3, 2019 5:26 PM To: Kevin Groeneveld; dmaengine@vger.kernel.org Subject: Re: kernel freeze regression relating to imx-sdma
Hi Kevin,
Am Donnerstag, den 03.10.2019, 21:11 +0000 schrieb Kevin Groeneveld:
We have recently come across an issue which can cause the linux kernel to freeze on an iMX6 based system. After bisecting the issue I have found the issue is related to the following imx-sdma commits:
64068853bc77786d1a28abb4087d6a3e93aedbe2 dmaengine: imx-sdma: use GFP_NOWAIT for dma descriptor allocations b8603d2a5795c42f78998e70dc792336e0dc20c9 dmaengine: imx-sdma: implement channel termination via worker ebb853b1bd5f659b92c71dc6a9de44cfc37c78c0 Revert "dmaengine: imx-sdma: alloclate bd memory from dma pool" c06abca69218ac42fa58d1ba7a7b0d9bab5f1b18 Revert "dmaengine: imx-sdma: Use GFP_NOWAIT for dma allocations"
If I revert these four commits the freeze does not happen. The issue is still present up to at least v5.3.
The problem seems to be related to overruns when doing an SPDIF capture, particularly when there is not actually a valid SPDIF input to the iMX6. When the issue occurs the following is (usually) printed on the console (using a v5.0 based kernel):
[ 1251.944877] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 1251.950823] rcu: (detected by 0, t=2102 jiffies, g=220425, q=0) [ 1251.956866] rcu: All QSes seen, last rcu_sched kthread activity 2102 (95117-93015), jiffies_till_next_fqs=1, root ->qsmask 0x0 [ 1251.968266] arecord S 0 1916 1908 0x00000002 [ 1251.973816] [<8010dcec>] (unwind_backtrace) from [<8010a4c4>] (show_stack+0x10/0x14) [ 1251.981596] [<8010a4c4>] (show_stack) from [<8016b314>] (rcu_check_callbacks+0x57c/0x5d8) [ 1251.989806] [<8016b314>] (rcu_check_callbacks) from [<8016fe84>] (update_process_times+0x30/0x5c) [ 1251.998709] [<8016fe84>] (update_process_times) from [<8017e960>] (tick_sched_timer+0x44/0x94) [ 1252.007353] [<8017e960>] (tick_sched_timer) from [<80170a84>] (__hrtimer_run_queues+0xd8/0x160) [ 1252.016081] [<80170a84>] (__hrtimer_run_queues) from [<8017103c>] (hrtimer_interrupt+0xdc/0x284) [ 1252.024891] [<8017103c>] (hrtimer_interrupt) from [<8010d6c0>] (twd_handler+0x2c/0x38) [ 1252.032839] [<8010d6c0>] (twd_handler) from [<80160b34>] (handle_percpu_devid_irq+0x70/0x11c) [ 1252.041401] [<80160b34>] (handle_percpu_devid_irq) from [<8015c234>] (generic_handle_irq+0x18/0x28) [ 1252.050480] [<8015c234>] (generic_handle_irq) from [<8015c814>] (__handle_domain_irq+0xa0/0xb4) [ 1252.059211] [<8015c814>] (__handle_domain_irq) from [<8039098c>] (gic_handle_irq+0x58/0x90) [ 1252.067590] [<8039098c>] (gic_handle_irq) from [<80101a0c>] (__irq_svc+0x6c/0x90) [ 1252.075085] Exception stack(0x848afe30 to 0x848afe78) [ 1252.080155] fe20: 8decf200 00000000 00000000 80b84ac8 [ 1252.088354] fe40: 00000000 8decf200 892f7800 00001589 892f7800 00000000 00000000 848ae000 [ 1252.096549] fe60: 8d8cf000 848afe80 805721e8 8056aae4 60000013 ffffffff [ 1252.103200] [<80101a0c>] (__irq_svc) from [<8056aae4>] (__snd_pcm_stream_unlock_mode+0xc0/0xd8) [ 1252.111925] [<8056aae4>] (__snd_pcm_stream_unlock_mode) from [<00000000>] ( (null)) [ 1252.119695] rcu: rcu_sched kthread starved for 2102 jiffies! g220425 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 [ 1252.129877] rcu: RCU grace-period kthread stack dump: [ 1252.134938] rcu_sched R running task 0 10 2 0x00000000 [ 1252.142034] [<806eee5c>] (__schedule) from [<806eef80>] (schedule+0x90/0xa0) [ 1252.149116] [<806eef80>] (schedule) from [<806f2244>] (schedule_timeout+0x1f0/0x238) [ 1252.156891] [<806f2244>] (schedule_timeout) from [<8016940c>] (rcu_gp_kthread+0x550/0x8f0) [ 1252.165183] [<8016940c>] (rcu_gp_kthread) from [<8013a358>] (kthread+0x110/0x128) [ 1252.172693] [<8013a358>] (kthread) from [<801010e8>] (ret_from_fork+0x14/0x2c) [ 1252.179925] Exception stack(0x8f877fb0 to 0x8f877ff8) [ 1252.184991] 7fa0: 00000000 00000000 00000000 00000000 [ 1252.193187] 7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 1252.201380] 7fe0: 00000000 00000000 00000000 00000000 00000013 00000000
I suspect the issue may not actually be a bug with imx-sdma but a race condition elsewhere that is exposed by the four mentioned commits.
I did notice that when an SPDIF overrun occurs dmaengine_terminate_async is called but dmaengine_synchronize is not. Should these calls always be paired? Or is it okay to call dmaengine_terminate_async without dmaengine_synchronize in some cases?
Yes, it seems this uncovered a bug in the SPDIF driver. Before those commits the dmaengine_terminate_async in the SDMA driver wasn't really async, now it is. The API requires that you call dmaengine_synchronize eventually after the terminate_async before starting any new operations on the dmaengine. It doesn't necessarily need to be in the same codepath, but the the async terminate always needs to be paired up with a synchronize.
I don't think the bug is limited to SPDIF. The calls to dmaengine_terminate_async are in sound/core/pcm_dmaengine.c. When the ALSA device is closed it does call dmaengine_synchronize. But when there is an overrun it does not. It was commit bc0e7345168c0f7483d2d1da86285d89136417cd (by Lars-Peter Clausen) where the call to dmaengine_synchronize was added for closing an ALSA device. It seems the overrun case was overlooked? I think maybe I should copy the ALSA list on this as well.
I have also found another manifestation of the imx-sdma change which may be useful for others trying to debug this. If I run the following command:
arecord -Dhw:4,0 -fS24_LE -c2 -r44100 --buffer-size 100 > /dev/null
With kernel v4.20-rc5 I get an endless stream of overrun messages. With kernel v4.20-rc6 (which includes the four imx-sdma commits mentioned above) I get a few overrun messages but then arecord exits with "arecord: pcm_read:2096: read error: Input/output error"
Best regards, Kevin
participants (1)
-
Kevin Groeneveld