[PATCH] ALSA: firewire-lib: restore process context workqueue to prevent deadlock
Commit b5b519965c4c ("ALSA: firewire-lib: operate for period elapse event in process context") removed the process context workqueue from amdtp_domain_stream_pcm_pointer() and update_pcm_pointers() to remove its overhead.
With RME Fireface 800, this lead to a regression since Kernels 5.14.0, causing a deadlock with eventual system freeze under ALSA operation:
? tasklet_unlock_spin_wait </NMI> <TASK> ohci_flush_iso_completions firewire_ohci amdtp_domain_stream_pcm_pointer snd_firewire_lib snd_pcm_update_hw_ptr0 snd_pcm snd_pcm_status64 snd_pcm
? native_queued_spin_lock_slowpath </NMI> <IRQ> _raw_spin_lock_irqsave snd_pcm_period_elapsed snd_pcm process_rx_packets snd_firewire_lib irq_target_callback snd_firewire_lib handle_it_packet firewire_ohci context_tasklet firewire_ohci
Restore the work queue to prevent deadlock between ALSA substream process context spin_lock of snd_pcm_stream_lock_irq() in snd_pcm_status64() and OHCI 1394 IT softIRQ context spin_lock of snd_pcm_stream_lock_irqsave() in snd_pcm_period_elapsed().
to reproduce the issue: direct ALSA playback to the device: mpv --audio-device=alsa/sysdefault:CARD=Fireface800 Spor-Ignition.flac Time to occurrence: 2s to 30m Likelihood increased by: - high CPU load stress --cpu $(nproc) - switching between applications via workspaces tested with i915 in Xfce PulsaAudio / PipeWire conceal the issue as they run PCM substream without period wakeup mode, issuing less hardIRQs.
Closes: https://lore.kernel.org/regressions/kwryofzdmjvzkuw6j3clftsxmoolynljztxqwg76... Fixes: 7ba5ca32fe6e ("ALSA: firewire-lib: operate for period elapse event in process context") Signed-off-by: Edmund Raile edmund.raile@proton.me --- This is the follow-up patch to the 5.14.0 regression I reported: https://lore.kernel.org/regressions/kwryofzdmjvzkuw6j3clftsxmoolynljztxqwg76... ("[REGRESSION] ALSA: firewire-lib: snd_pcm_period_elapsed deadlock with Fireface 800")
Takashi Sakamoto explained the issue in his response to the regression: A. In the process context * (lock A) Acquiring spin_lock by snd_pcm_stream_lock_irq() in snd_pcm_status64() * (lock B) Then attempt to enter tasklet
B. In the softIRQ context * (lock B) Enter tasklet * (lock A) Attempt to acquire spin_lock by snd_pcm_stream_lock_irqsave() in snd_pcm_period_elapsed()
This leads me to believe this isn't just an issue limited to the RME Fireface
driver (snd_fireface), though I can not test the other devices. sound/firewire/amdtp-stream.c | 32 +++++++++++++++++++++----------- sound/firewire/amdtp-stream.h | 1 + 2 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/sound/firewire/amdtp-stream.c b/sound/firewire/amdtp-stream.c index d35d0a420ee0..77b99a2117f4 100644 --- a/sound/firewire/amdtp-stream.c +++ b/sound/firewire/amdtp-stream.c @@ -77,6 +77,8 @@ // overrun. Actual device can skip more, then this module stops the packet streaming. #define IR_JUMBO_PAYLOAD_MAX_SKIP_CYCLES 5
+static void pcm_period_work(struct work_struct *work); + /** * amdtp_stream_init - initialize an AMDTP stream structure * @s: the AMDTP stream to initialize @@ -105,6 +107,7 @@ int amdtp_stream_init(struct amdtp_stream *s, struct fw_unit *unit, s->flags = flags; s->context = ERR_PTR(-1); mutex_init(&s->mutex); + INIT_WORK(&s->period_work, pcm_period_work); s->packet_index = 0;
init_waitqueue_head(&s->ready_wait); @@ -347,6 +350,7 @@ EXPORT_SYMBOL(amdtp_stream_get_max_payload); */ void amdtp_stream_pcm_prepare(struct amdtp_stream *s) { + cancel_work_sync(&s->period_work); s->pcm_buffer_pointer = 0; s->pcm_period_pointer = 0; } @@ -611,19 +615,21 @@ static void update_pcm_pointers(struct amdtp_stream *s, // The program in user process should periodically check the status of intermediate // buffer associated to PCM substream to process PCM frames in the buffer, instead // of receiving notification of period elapsed by poll wait. - if (!pcm->runtime->no_period_wakeup) { - if (in_softirq()) { - // In software IRQ context for 1394 OHCI. - snd_pcm_period_elapsed(pcm); - } else { - // In process context of ALSA PCM application under acquired lock of - // PCM substream. - snd_pcm_period_elapsed_under_stream_lock(pcm); - } - } + if (!pcm->runtime->no_period_wakeup) + queue_work(system_highpri_wq, &s->period_work); } }
+static void pcm_period_work(struct work_struct *work) +{ + struct amdtp_stream *s = container_of(work, struct amdtp_stream, + period_work); + struct snd_pcm_substream *pcm = READ_ONCE(s->pcm); + + if (pcm) + snd_pcm_period_elapsed(pcm); +} + static int queue_packet(struct amdtp_stream *s, struct fw_iso_packet *params, bool sched_irq) { @@ -1854,7 +1860,10 @@ unsigned long amdtp_domain_stream_pcm_pointer(struct amdtp_domain *d, if (irq_target && amdtp_stream_running(irq_target)) { // In software IRQ context, the call causes dead-lock to disable the tasklet // synchronously. - if (!in_softirq()) + // workqueue serves to prevent deadlock between process context spinlock + // of snd_pcm_stream_lock_irq() in snd_pcm_status64() and softIRQ spinlock + // of snd_pcm_stream_lock_irqsave() in snd_pcm_period_elapsed() + if ((!in_softirq()) && (current_work() != &s->period_work)) fw_iso_context_flush_completions(irq_target->context); }
@@ -1910,6 +1919,7 @@ static void amdtp_stream_stop(struct amdtp_stream *s) return; }
+ cancel_work_sync(&s->period_work); fw_iso_context_stop(s->context); fw_iso_context_destroy(s->context); s->context = ERR_PTR(-1); diff --git a/sound/firewire/amdtp-stream.h b/sound/firewire/amdtp-stream.h index a1ed2e80f91a..775db3fc4959 100644 --- a/sound/firewire/amdtp-stream.h +++ b/sound/firewire/amdtp-stream.h @@ -191,6 +191,7 @@ struct amdtp_stream {
/* For a PCM substream processing. */ struct snd_pcm_substream *pcm; + struct work_struct period_work; snd_pcm_uframes_t pcm_buffer_pointer; unsigned int pcm_period_pointer; unsigned int pcm_frame_multiplier;
Hi,
Thank you for your posting the patch, however I have some nitpickings (and some requests to you).
On Thu, Jul 18, 2024 at 11:56:54AM +0000, Edmund Raile wrote:
Commit b5b519965c4c ("ALSA: firewire-lib: operate for period elapse event in process context") removed the process context workqueue from amdtp_domain_stream_pcm_pointer() and update_pcm_pointers() to remove its overhead.
With RME Fireface 800, this lead to a regression since Kernels 5.14.0, causing a deadlock with eventual system freeze under ALSA operation:
? tasklet_unlock_spin_wait
</NMI> <TASK> ohci_flush_iso_completions firewire_ohci amdtp_domain_stream_pcm_pointer snd_firewire_lib snd_pcm_update_hw_ptr0 snd_pcm snd_pcm_status64 snd_pcm
? native_queued_spin_lock_slowpath
</NMI> <IRQ> _raw_spin_lock_irqsave snd_pcm_period_elapsed snd_pcm process_rx_packets snd_firewire_lib irq_target_callback snd_firewire_lib handle_it_packet firewire_ohci context_tasklet firewire_ohci
Restore the work queue to prevent deadlock between ALSA substream process context spin_lock of snd_pcm_stream_lock_irq() in snd_pcm_status64() and OHCI 1394 IT softIRQ context spin_lock of snd_pcm_stream_lock_irqsave() in snd_pcm_period_elapsed().
to reproduce the issue: direct ALSA playback to the device: mpv --audio-device=alsa/sysdefault:CARD=Fireface800 Spor-Ignition.flac Time to occurrence: 2s to 30m Likelihood increased by:
- high CPU load stress --cpu $(nproc)
- switching between applications via workspaces tested with i915 in Xfce
PulsaAudio / PipeWire conceal the issue as they run PCM substream without period wakeup mode, issuing less hardIRQs.
Closes: https://lore.kernel.org/regressions/kwryofzdmjvzkuw6j3clftsxmoolynljztxqwg76... Fixes: 7ba5ca32fe6e ("ALSA: firewire-lib: operate for period elapse event in process context") Signed-off-by: Edmund Raile edmund.raile@proton.me
This is the follow-up patch to the 5.14.0 regression I reported: https://lore.kernel.org/regressions/kwryofzdmjvzkuw6j3clftsxmoolynljztxqwg76... ("[REGRESSION] ALSA: firewire-lib: snd_pcm_period_elapsed deadlock with Fireface 800")
Takashi Sakamoto explained the issue in his response to the regression: A. In the process context * (lock A) Acquiring spin_lock by snd_pcm_stream_lock_irq() in snd_pcm_status64() * (lock B) Then attempt to enter tasklet
B. In the softIRQ context * (lock B) Enter tasklet * (lock A) Attempt to acquire spin_lock by snd_pcm_stream_lock_irqsave() in snd_pcm_period_elapsed()
This leads me to believe this isn't just an issue limited to the RME Fireface
We are in the merge window to Linux kernel 6.11[1]. I prefer reviewing a small and trivial patches in the weeks, thus I would like to postpone applying this kind of patches after releasing -rc1 even if they look good.
In Linux kernel development, we have the process to apply patches into released kernels to fix bugs and regressions. As of today, the developers pay some of their efforts to maintain the following kernels[2]:
* 6.10 * 6.9.10 * 6.6.41 * 6.1.100 * 5.15.163 * 5.10.222 * 5.4.280 * 4.19.318
It is cleared that the issued changes were applied to v5.14 kernel, thus we should lead the developers to find the posted patches and apply them to future releases of each version. The process[3] have come into existence enough before the regression report procedure[4]. I would like you to read some documentation about the process and add more care for stable kernel maintainers.
Well, the issued commits are (older at first):
* 7ba5ca32fe6 * b5b519965c4
As long as I can see, these commits can be reverted per each, with a slight handy-changes. In the case, it is preferable to make a patchset including these two revert patches. I would like to request it to you, instead of the all-in-one patch, so that developers easily find the issued commits (and work to apply these patches into kernels maintained publicly/locally).
At last, I prefer that the whole patch comment is written by the posters, instead of referring to comments by the others. I know that the description about AB/BA deadlock is a bit hard to write, but it is enough and satisfied for you to write what you understand about the issue. I'd accept it.
[1] https://lore.kernel.org/lkml/CAHk-=wjV_O2g_K19McjGKrxFxMFDqex+fyGcKc3uac1ft_... [2] https://kernel.org/ [3] https://docs.kernel.org/process/stable-kernel-rules.html [4] https://docs.kernel.org/process/handling-regressions.html
Thanks
Takashi Sakamoto
participants (2)
-
Edmund Raile
-
Takashi Sakamoto