Context expectations in ALSA
Hi!
We currently have an issue reported by lockdep on the RaspberryPi and its HDMI audio output where, at startup, we end up scheduling in atomic context.
This is caused by the HDMI driver polling some status bit that reports that the infoframes have been properly sent, and calling usleep_range between each iteration[1], and that is done in our trigger callback that seems to be run with a spinlock taken and the interrupt disabled (snd_pcm_action_lock_irq) as part of snd_pcm_start_lock_irq. This is the entire stack trace:
# aplay --channels=2 -D 'iec958:CARD=vc4hdmi0,DEV=0' /root/test-tone.wav Playing WAVE '/root/test-tone.wav' : Signed 16 bit Little Endian, Rate 44100 Hz, Stereo [ 14.732730] BUG: sleeping function called from invalid context at drivers/gpu/drm/vc4/vc4_hdmi.c:276 [ 14.742005] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 140, name: aplay [ 14.749955] CPU: 0 PID: 140 Comm: aplay Not tainted 5.9.0-rc5-v7l+ #66 [ 14.756578] Hardware name: BCM2711 [ 14.760026] Backtrace: [ 14.762524] [<c020d250>] (dump_backtrace) from [<c020d5f4>] (show_stack+0x20/0x24) [ 14.770209] r7:c167c9e4 r6:600e0093 r5:00000000 r4:c167c9e4 [ 14.775960] [<c020d5d4>] (show_stack) from [<c0703164>] (dump_stack+0xb8/0xd8) [ 14.783297] [<c07030ac>] (dump_stack) from [<c025108c>] (___might_sleep+0x138/0x17c) [ 14.791159] r10:00000084 r9:f0979b00 r8:c10ffd50 r7:00000010 r6:00000000 r5:00000114 [ 14.799105] r4:c10fe000 r3:600e0093 [ 14.802736] [<c0250f54>] (___might_sleep) from [<c0251140>] (__might_sleep+0x70/0xb0) [ 14.810680] r4:c0e8f5f4 [ 14.813256] [<c02510d0>] (__might_sleep) from [<c0840724>] (vc4_hdmi_stop_packet+0x120/0x330) [ 14.821907] r6:ef21f280 r5:00000003 r4:73e746f9 [ 14.826598] [<c0840604>] (vc4_hdmi_stop_packet) from [<c084422c>] (vc4_hdmi_write_infoframe+0x140/0x5d0) [ 14.836220] r9:f0979b00 r8:c10ffd50 r7:ef219900 r6:ef248840 r5:00000000 r4:ef21f280 [ 14.844084] [<c08440ec>] (vc4_hdmi_write_infoframe) from [<c0844718>] (vc4_hdmi_set_audio_infoframe+0x5c/0x80) [ 14.854236] r10:efac7738 r9:ef3ea240 r8:00000001 r7:ef219900 r6:ef248840 r5:ef21f040 [ 14.862180] r4:ef21f280 [ 14.864756] [<c08446bc>] (vc4_hdmi_set_audio_infoframe) from [<c0844774>] (vc4_hdmi_audio_trigger+0x38/0x238) [ 14.874815] r4:00000001 [ 14.877392] [<c084473c>] (vc4_hdmi_audio_trigger) from [<c0a155f0>] (snd_soc_pcm_dai_trigger+0x64/0xcc) [ 14.886923] r5:efac5a00 r4:00000000 [ 14.890555] [<c0a1558c>] (snd_soc_pcm_dai_trigger) from [<c0a18c70>] (soc_pcm_trigger+0x70/0xac) [ 14.899472] r9:ef3ea240 r8:c1133540 r7:00000003 r6:ef219900 r5:ef219900 r4:00000001 [ 14.907336] [<c0a18c00>] (soc_pcm_trigger) from [<c09fbd84>] (snd_pcm_do_start+0x3c/0x40) [ 14.915633] r5:c0c749dc r4:00000000 [ 14.919264] [<c09fbd48>] (snd_pcm_do_start) from [<c09fa390>] (snd_pcm_action_single+0x48/0x88) [ 14.928094] [<c09fa348>] (snd_pcm_action_single) from [<c09fc830>] (snd_pcm_action+0x6c/0x78) [ 14.936746] r7:00000003 r6:c0c749dc r5:00000000 r4:ef219900 [ 14.942493] [<c09fc7c4>] (snd_pcm_action) from [<c09fd7ec>] (snd_pcm_action_lock_irq+0x38/0x50) [ 14.951320] r7:00000030 r6:00000003 r5:c0c749dc r4:ef219900 [ 14.957069] [<c09fd7b4>] (snd_pcm_action_lock_irq) from [<c09ffd2c>] (snd_pcm_ioctl+0x930/0x14e4) [ 14.966073] r7:00000030 r6:00000000 r5:0085ee60 r4:ef219900 [ 14.971823] [<c09ff3fc>] (snd_pcm_ioctl) from [<c043accc>] (sys_ioctl+0xe4/0x9e4) [ 14.979420] r10:efac7738 r9:00000004 r8:c1133540 r7:0085ee60 r6:00005624 r5:c1133540 [ 14.987364] r4:00004142 [ 14.989940] [<c043abe8>] (sys_ioctl) from [<c0200040>] (ret_fast_syscall+0x0/0x28) [ 14.997621] Exception stack(0xc10fffa8 to 0xc10ffff0) [ 15.002749] ffa0: 00869260 b6fa8000 00000004 00004142 0085ee60 00000001 [ 15.011050] ffc0: 00869260 b6fa8000 00005624 00000036 00000000 00000000 00001589 00001589 [ 15.019350] ffe0: b6fa8504 bedb59fc b6f2ff68 b6da515c [ 15.024479] r10:00000036 r9:c10fe000 r8:c0200204 r7:00000036 r6:00005624 r5:b6fa8000 [ 15.032423] r4:00869260
We could switch to a udelay instead, but the idea of busy-waiting for up to a 100ms while having the interrupt disabled doesn't sound ideal either.
It looks like the snd_soc_dai_link structure has a nonatomic flag that seems to be made to address more or less that issue, taking a mutex instead of a spinlock. However setting that flag results in another lockdep issue, since the dmaengine controller doing the DMA transfer would call snd_pcm_period_elapsed on completion, in a tasklet, this time taking a mutex in an atomic context which is just as bad as the initial issue. This is the stacktrace this time:
# aplay --channels=2 -D 'iec958:CARD=vc4hdmi0,DEV=0' /root/test-tone.wav Playing WAVE '/root/test-tone.wav' : Signed 16 bit Little Endian, Rate 44100 Hz, Stereo [ 43.078900] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:281 [ 43.087485] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 0, name: swapper/0 [ 43.095452] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.0-rc5-v7l+ #67 [ 43.102252] Hardware name: BCM2711 [ 43.105701] Backtrace: [ 43.108199] [<c020d250>] (dump_backtrace) from [<c020d5f4>] (show_stack+0x20/0x24) [ 43.115884] r7:c167c9e4 r6:600e0113 r5:00000000 r4:c167c9e4 [ 43.121636] [<c020d5d4>] (show_stack) from [<c0703164>] (dump_stack+0xb8/0xd8) [ 43.128972] [<c07030ac>] (dump_stack) from [<c025108c>] (___might_sleep+0x138/0x17c) [ 43.136833] r10:c1600000 r9:00000006 r8:00000000 r7:c1601d70 r6:00000000 r5:00000119 [ 43.144777] r4:c1600000 r3:600e0113 [ 43.148409] [<c0250f54>] (___might_sleep) from [<c0251140>] (__might_sleep+0x70/0xb0) [ 43.156353] r4:c0e32994 [ 43.158930] [<c02510d0>] (__might_sleep) from [<c0baf508>] (mutex_lock+0x2c/0x74) [ 43.166524] r6:efbfaa84 r5:ef26f548 r4:ef219998 [ 43.171216] [<c0baf4dc>] (mutex_lock) from [<c09fa82c>] (_snd_pcm_stream_lock_irqsave+0x2c/0x40) [ 43.180130] r5:ef26f548 r4:ef219900 [ 43.183765] [<c09fa800>] (_snd_pcm_stream_lock_irqsave) from [<c0a03848>] (snd_pcm_period_elapsed+0x20/0xa4) [ 43.193743] [<c0a03828>] (snd_pcm_period_elapsed) from [<c0a067d4>] (dmaengine_pcm_dma_complete+0x54/0x58) [ 43.203539] r7:c1601d70 r6:efbfaa84 r5:ef26f548 r4:00000000 [ 43.209290] [<c0a06780>] (dmaengine_pcm_dma_complete) from [<c07885e8>] (vchan_complete+0x238/0x240) [ 43.218567] [<c07883b0>] (vchan_complete) from [<c0228f9c>] (tasklet_action_common.constprop.0+0x84/0x12c) [ 43.228365] r10:c1600000 r9:00000006 r8:00000000 r7:dac802e0 r6:efbfaa84 r5:00000000 [ 43.236309] r4:efbfaa80 [ 43.238886] [<c0228f18>] (tasklet_action_common.constprop.0) from [<c022906c>] (tasklet_action+0x28/0x30) [ 43.248596] r9:00000040 r8:ef80b800 r7:00000101 r6:00000006 r5:00000007 r4:c1603098 [ 43.256459] [<c0229044>] (tasklet_action) from [<c0201284>] (__do_softirq+0x18c/0x47c) [ 43.264498] [<c02010f8>] (__do_softirq) from [<c0228de4>] (irq_exit+0xd0/0xf8) [ 43.271829] r10:c1051f40 r9:c1600000 r8:ef80b800 r7:00000001 r6:00000000 r5:00000000 [ 43.279774] r4:ffffe000 [ 43.282349] [<c0228d14>] (irq_exit) from [<c0287c54>] (__handle_domain_irq+0x70/0xc0) [ 43.290294] r5:00000000 r4:c148ee80 [ 43.293927] [<c0287be4>] (__handle_domain_irq) from [<c071d8b4>] (gic_handle_irq+0x4c/0x88) [ 43.302403] r9:c1600000 r8:c1601ed0 r7:f0803000 r6:f0802000 r5:f080200c r4:c16058d8 [ 43.310266] [<c071d868>] (gic_handle_irq) from [<c0200abc>] (__irq_svc+0x5c/0x7c) [ 43.317856] Exception stack(0xc1601ed0 to 0xc1601f18) [ 43.322983] 1ec0: c0208b84 00000000 19bf8000 600e0093 [ 43.331283] 1ee0: c1600000 00000000 c1604e34 c1604e7c c16d20e5 c0e30cac c1051f40 c1601f2c [ 43.339582] 1f00: ffffe000 c1601f20 c1600000 c0208b88 600e0013 ffffffff [ 43.346297] r9:c1600000 r8:c16d20e5 r7:c1601f04 r6:ffffffff r5:600e0013 r4:c0208b88 [ 43.354164] [<c0208b54>] (arch_cpu_idle) from [<c0bb23c8>] (default_idle_call+0x3c/0x184) [ 43.362469] [<c0bb238c>] (default_idle_call) from [<c025b5f8>] (do_idle+0x110/0x170) [ 43.370327] r5:00000000 r4:c1600000 [ 43.373958] [<c025b4e8>] (do_idle) from [<c025b950>] (cpu_startup_entry+0x28/0x2c) [ 43.381641] r9:00000193 r8:effff900 r7:c1604e00 r6:00000000 r5:c160ec50 r4:000000d7 [ 43.389503] [<c025b928>] (cpu_startup_entry) from [<c0baa65c>] (rest_init+0x108/0x110) [ 43.397542] [<c0baa554>] (rest_init) from [<c1000cc8>] (arch_call_rest_init+0x18/0x1c) [ 43.405575] r5:c1051f40 r4:c16e7078 [ 43.409205] [<c1000cb0>] (arch_call_rest_init) from [<c1001538>] (start_kernel+0x7f4/0x834) [ 43.417682] [<c1000d44>] (start_kernel) from [<00000000>] (0x0)
So, I'm not really sure what I'm supposed to do here. The drivers involved don't appear to be doing anything extraordinary, but the issues lockdep report are definitely valid too. What are the expectations in terms of context from ALSA when running the callbacks, and how can we fix it?
Thanks! Maxime
1: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...
Dne 22. 10. 20 v 11:50 Maxime Ripard napsal(a):
So, I'm not really sure what I'm supposed to do here. The drivers involved don't appear to be doing anything extraordinary, but the issues lockdep report are definitely valid too. What are the expectations in terms of context from ALSA when running the callbacks, and how can we fix it?
I think that you should set the non-atomic flag and wake up the workqueue or so from interrupt handler in this case. Call snd_pcm_period_elapsed() from the workqueue not the interrupt handler context.
Jaroslav
On Thu, Oct 22, 2020 at 12:03:19PM +0200, Jaroslav Kysela wrote:
Dne 22. 10. 20 v 11:50 Maxime Ripard napsal(a):
So, I'm not really sure what I'm supposed to do here. The drivers involved don't appear to be doing anything extraordinary, but the issues lockdep report are definitely valid too. What are the expectations in terms of context from ALSA when running the callbacks, and how can we fix it?
I think that you should set the non-atomic flag and wake up the workqueue or so from interrupt handler in this case. Call snd_pcm_period_elapsed() from the workqueue not the interrupt handler context.
Yeah, that was my first guess too. However, the DMA driver uses some kind of generic helpers using a tasklet, so getting rid of it would take some work and would very likely not be eligible for stable.
Maxime
On Thu, 22 Oct 2020 14:57:41 +0200, Maxime Ripard wrote:
On Thu, Oct 22, 2020 at 12:03:19PM +0200, Jaroslav Kysela wrote:
Dne 22. 10. 20 v 11:50 Maxime Ripard napsal(a):
So, I'm not really sure what I'm supposed to do here. The drivers involved don't appear to be doing anything extraordinary, but the issues lockdep report are definitely valid too. What are the expectations in terms of context from ALSA when running the callbacks, and how can we fix it?
I think that you should set the non-atomic flag and wake up the workqueue or so from interrupt handler in this case. Call snd_pcm_period_elapsed() from the workqueue not the interrupt handler context.
Yeah, that was my first guess too. However, the DMA driver uses some kind of generic helpers using a tasklet, so getting rid of it would take some work and would very likely not be eligible for stable.
Who sets the nonatomic flag for vc4? I couldn't find the relevant code in the latest upstream.
Ideally dmaengine PCM helper should support the nonatomic mode, but until then, the other side needs to drop the nonatomic flag, I suppose.
thanks,
Takashi
Hi Takashi,
On Thu, Oct 22, 2020 at 03:20:49PM +0200, Takashi Iwai wrote:
On Thu, 22 Oct 2020 14:57:41 +0200, Maxime Ripard wrote:
On Thu, Oct 22, 2020 at 12:03:19PM +0200, Jaroslav Kysela wrote:
Dne 22. 10. 20 v 11:50 Maxime Ripard napsal(a):
So, I'm not really sure what I'm supposed to do here. The drivers involved don't appear to be doing anything extraordinary, but the issues lockdep report are definitely valid too. What are the expectations in terms of context from ALSA when running the callbacks, and how can we fix it?
I think that you should set the non-atomic flag and wake up the workqueue or so from interrupt handler in this case. Call snd_pcm_period_elapsed() from the workqueue not the interrupt handler context.
Yeah, that was my first guess too. However, the DMA driver uses some kind of generic helpers using a tasklet, so getting rid of it would take some work and would very likely not be eligible for stable.
Who sets the nonatomic flag for vc4? I couldn't find the relevant code in the latest upstream.
Sorry if this wasn't clear enough, it's not there at the moment, ALSA takes a spinlock and lockdep complains that we're sleeping in an atomic context.
I tried to add the nonatomic flag in my tree to see if it was fixing the issue, but ran into another lockdep complain now with ALSA taking a mutex in a tasklet.
Ideally dmaengine PCM helper should support the nonatomic mode, but until then, the other side needs to drop the nonatomic flag, I suppose.
In this case, I'm not sure the blame is in the PCM helper but if there's any blame, I guess it's the virt-chan layer inside dmaengine (so for providers) that use a tasklet instead of something that allows sleeping
Maxime
On Thu, 22 Oct 2020 15:24:12 +0200, Maxime Ripard wrote:
Hi Takashi,
On Thu, Oct 22, 2020 at 03:20:49PM +0200, Takashi Iwai wrote:
On Thu, 22 Oct 2020 14:57:41 +0200, Maxime Ripard wrote:
On Thu, Oct 22, 2020 at 12:03:19PM +0200, Jaroslav Kysela wrote:
Dne 22. 10. 20 v 11:50 Maxime Ripard napsal(a):
So, I'm not really sure what I'm supposed to do here. The drivers involved don't appear to be doing anything extraordinary, but the issues lockdep report are definitely valid too. What are the expectations in terms of context from ALSA when running the callbacks, and how can we fix it?
I think that you should set the non-atomic flag and wake up the workqueue or so from interrupt handler in this case. Call snd_pcm_period_elapsed() from the workqueue not the interrupt handler context.
Yeah, that was my first guess too. However, the DMA driver uses some kind of generic helpers using a tasklet, so getting rid of it would take some work and would very likely not be eligible for stable.
Who sets the nonatomic flag for vc4? I couldn't find the relevant code in the latest upstream.
Sorry if this wasn't clear enough, it's not there at the moment, ALSA takes a spinlock and lockdep complains that we're sleeping in an atomic context.
I tried to add the nonatomic flag in my tree to see if it was fixing the issue, but ran into another lockdep complain now with ALSA taking a mutex in a tasklet.
I see, thanks for clarification.
Ideally dmaengine PCM helper should support the nonatomic mode, but until then, the other side needs to drop the nonatomic flag, I suppose.
In this case, I'm not sure the blame is in the PCM helper but if there's any blame, I guess it's the virt-chan layer inside dmaengine (so for providers) that use a tasklet instead of something that allows sleeping
Well, we have to align either to atomic or non-atomic operation. If we want to solve solely in vc4, the fix would be to make the trigger action into some own work.
Takashi
On Thu, Oct 22, 2020 at 11:50:41AM +0200, Maxime Ripard wrote:
This is caused by the HDMI driver polling some status bit that reports that the infoframes have been properly sent, and calling usleep_range between each iteration[1], and that is done in our trigger callback that seems to be run with a spinlock taken and the interrupt disabled (snd_pcm_action_lock_irq) as part of snd_pcm_start_lock_irq. This is the entire stack trace:
That doesn't sound like something I would expect you do be doing in the trigger callback TBH - it feels like if this is something that could block then the setup should have been done during parameter configuration or something rather than in trigger.
It looks like the snd_soc_dai_link structure has a nonatomic flag that seems to be made to address more or less that issue, taking a mutex instead of a spinlock. However setting that flag results in another lockdep issue, since the dmaengine controller doing the DMA transfer would call snd_pcm_period_elapsed on completion, in a tasklet, this time taking a mutex in an atomic context which is just as bad as the initial issue. This is the stacktrace this time:
Like Jaroslav says you could punt to a workqueue here. I'd be more inclined to move the sleeping stuff out of the trigger operations but that'd avoid the issue too. There are some drivers doing this already IIRC.
So, I'm not really sure what I'm supposed to do here. The drivers involved don't appear to be doing anything extraordinary, but the issues lockdep report are definitely valid too. What are the expectations in terms of context from ALSA when running the callbacks, and how can we fix it?
To me having something in the trigger that needs waiting for is the bit that feels the most awkward fit here, trigger is supposed to run very quickly.
Hi Mark
On Thu, Oct 22, 2020 at 02:50:53PM +0100, Mark Brown wrote:
On Thu, Oct 22, 2020 at 11:50:41AM +0200, Maxime Ripard wrote:
This is caused by the HDMI driver polling some status bit that reports that the infoframes have been properly sent, and calling usleep_range between each iteration[1], and that is done in our trigger callback that seems to be run with a spinlock taken and the interrupt disabled (snd_pcm_action_lock_irq) as part of snd_pcm_start_lock_irq. This is the entire stack trace:
That doesn't sound like something I would expect you do be doing in the trigger callback TBH - it feels like if this is something that could block then the setup should have been done during parameter configuration or something rather than in trigger.
It looks like the snd_soc_dai_link structure has a nonatomic flag that seems to be made to address more or less that issue, taking a mutex instead of a spinlock. However setting that flag results in another lockdep issue, since the dmaengine controller doing the DMA transfer would call snd_pcm_period_elapsed on completion, in a tasklet, this time taking a mutex in an atomic context which is just as bad as the initial issue. This is the stacktrace this time:
Like Jaroslav says you could punt to a workqueue here. I'd be more inclined to move the sleeping stuff out of the trigger operations but that'd avoid the issue too. There are some drivers doing this already IIRC.
So, I'm not really sure what I'm supposed to do here. The drivers involved don't appear to be doing anything extraordinary, but the issues lockdep report are definitely valid too. What are the expectations in terms of context from ALSA when running the callbacks, and how can we fix it?
To me having something in the trigger that needs waiting for is the bit that feels the most awkward fit here, trigger is supposed to run very quickly.
Indeed, other DRM devices seem to send the infoframes as part of hw_params, and it solves our issue there too. I'll send a patch
Thanks for the suggestion! Maxime
participants (4)
-
Jaroslav Kysela
-
Mark Brown
-
Maxime Ripard
-
Takashi Iwai