[alsa-devel] Thread spinning in kernel snd_pcm_link()/snd_pcm_unlink()
I'm trying to address a bug where we end up with a thread spinning and consuming an entire cpu. The issue seems to be this code in sound/core/pcm_native.c:
/* Writer in rwsem may block readers even during its waiting in queue, * and this may lead to a deadlock when the code path takes read sem * twice (e.g. one in snd_pcm_action_nonatomic() and another in * snd_pcm_stream_lock()). As a (suboptimal) workaround, let writer to * spin until it gets the lock. */ static inline void down_write_nonblock(struct rw_semaphore *lock) { while (!down_write_trylock(lock)) cond_resched(); }
The original commit for this is 67ec1072b053c15564e6090ab30127895dc77a89
What we're suspecting is that a normal thread (SCHED_OTHER) has a reader lock and a real-time thread using SCHED_RR or SCHED_FIFO is trying to take the writer lock. If both threads are pinned to the same CPU for some reason then the reader thread will never get scheduled (because the real-time writer thread is still runnable), and we will never make progress.
Does this sound right? What can we do to fix this?
Thanks,
Rob.
On Fri, 28 Sep 2018 18:23:24 +0200, Rob Duncan wrote:
I'm trying to address a bug where we end up with a thread spinning and consuming an entire cpu. The issue seems to be this code in sound/core/pcm_native.c:
/* Writer in rwsem may block readers even during its waiting in queue, * and this may lead to a deadlock when the code path takes read sem * twice (e.g. one in snd_pcm_action_nonatomic() and another in * snd_pcm_stream_lock()). As a (suboptimal) workaround, let writer to * spin until it gets the lock. */ static inline void down_write_nonblock(struct rw_semaphore *lock) { while (!down_write_trylock(lock)) cond_resched(); }
The original commit for this is 67ec1072b053c15564e6090ab30127895dc77a89
What we're suspecting is that a normal thread (SCHED_OTHER) has a reader lock and a real-time thread using SCHED_RR or SCHED_FIFO is trying to take the writer lock. If both threads are pinned to the same CPU for some reason then the reader thread will never get scheduled (because the real-time writer thread is still runnable), and we will never make progress.
Does this sound right? What can we do to fix this?
I'm not sure whether that's the case. Do you mean that one thread gets stuck at pcm_release_private() which calls snd_pcm_unlink()? Or do you really use the PCM linkage?
In the former case, we may loosen it by optimizing like the patch below (totally untested). I guess it won't be a problem about racy access, but need double-checks afterward.
thanks,
Takashi
--- a/sound/core/pcm_native.c +++ b/sound/core/pcm_native.c @@ -2369,7 +2369,8 @@ int snd_pcm_hw_constraints_complete(struct snd_pcm_substream *substream)
static void pcm_release_private(struct snd_pcm_substream *substream) { - snd_pcm_unlink(substream); + if (snd_pcm_stream_linked(substream)) + snd_pcm_unlink(substream); }
void snd_pcm_release_substream(struct snd_pcm_substream *substream)
Hi Takashi,
Thanks for taking a look at this.
I'm not sure whether that's the case. Do you mean that one thread gets stuck at pcm_release_private() which calls snd_pcm_unlink()? Or do you really use the PCM linkage?
We're not explicitly using the link/unlink APIs, so I think it must be pcm_release_private().
I'll try out your suggestion over the next couple of days. In the meantime we've avoided the issue by arranging for the realtime threads to have the same priority (which I think they should have anyway).
Rob.
At 08:14 on Tue, Oct 02 2018, Takashi wrote:
On Fri, 28 Sep 2018 18:23:24 +0200, Rob Duncan wrote:
I'm trying to address a bug where we end up with a thread spinning and consuming an entire cpu. The issue seems to be this code in sound/core/pcm_native.c:
/* Writer in rwsem may block readers even during its waiting in queue, * and this may lead to a deadlock when the code path takes read sem * twice (e.g. one in snd_pcm_action_nonatomic() and another in * snd_pcm_stream_lock()). As a (suboptimal) workaround, let writer to * spin until it gets the lock. */ static inline void down_write_nonblock(struct rw_semaphore *lock) { while (!down_write_trylock(lock)) cond_resched(); }
The original commit for this is 67ec1072b053c15564e6090ab30127895dc77a89
What we're suspecting is that a normal thread (SCHED_OTHER) has a reader lock and a real-time thread using SCHED_RR or SCHED_FIFO is trying to take the writer lock. If both threads are pinned to the same CPU for some reason then the reader thread will never get scheduled (because the real-time writer thread is still runnable), and we will never make progress.
Does this sound right? What can we do to fix this?
I'm not sure whether that's the case. Do you mean that one thread gets stuck at pcm_release_private() which calls snd_pcm_unlink()? Or do you really use the PCM linkage?
In the former case, we may loosen it by optimizing like the patch below (totally untested). I guess it won't be a problem about racy access, but need double-checks afterward.
thanks,
Takashi
--- a/sound/core/pcm_native.c +++ b/sound/core/pcm_native.c @@ -2369,7 +2369,8 @@ int snd_pcm_hw_constraints_complete(struct snd_pcm_substream *substream)
static void pcm_release_private(struct snd_pcm_substream *substream) {
- snd_pcm_unlink(substream);
- if (snd_pcm_stream_linked(substream))
snd_pcm_unlink(substream);
}
void snd_pcm_release_substream(struct snd_pcm_substream *substream)
participants (2)
-
Rob Duncan
-
Takashi Iwai