[Whipping the old thread again, as I'm finally catching up the backlog after vacation]
On Fri, 16 Dec 2016 17:40:54 +0100, Greg Kroah-Hartman wrote:
On Thu, Dec 15, 2016 at 12:32:08PM +0100, Takashi Iwai wrote:
On Wed, 14 Dec 2016 22:00:50 +0100, Imre Deak wrote:
Hi,
I got the trace below while trying to unload (unbind) snd_hda_intel, while its still loading the HDMI codec driver. IIUC what happens is:
Task1 Task2 Task3 modprobe snd_hda_intel schedule(azx_probe_work) unbind snd_hda_intel via sysfs device_release_driver() device_lock(snd_hda_intel) azx_remove() cancel_work_sync(azx_probe_work) azx_probe_work() request_module(snd-hda-codec-hdmi) hdmi_driver_init() __driver_attach() device_lock(snd_hda_intel)
Deadlock, since azx_probe_work() will never finish and the snd_hda_intel device lock will never get released.
This is indeed nasty. The deadlock happens when the driver core takes the parent's device lock.
static int __driver_attach(struct device *dev, void *data) { .... if (dev->parent) /* Needed for USB */ device_lock(dev->parent); device_lock(dev); if (!dev->driver) driver_probe_device(drv, dev);
I vaguely remember of some other issue due to the device_lock of the parent device. And, I guess a similar deadlock may happen not only with HD-audio driver but also in general with every driver using async probe.
Greg, any good way to avoid such a deadlock? Can we make the parent device lock conditional somehow?
Ick, messy. I don't want to make the parent lock conditional, as it's needed. Shouldn't the cancel_work_sync() prevent the request_module() from running? Seems like you need to serialize your probe_work somehow...
The situation is a bit complex. The work itself was kicked off by the controller driver's probe(), in order to make the codec binding asynchronous. And we can't serialize inside the remove() because it is already in the lock.
I guess a workaround for the time being would be just to unlock the device temporarily during this cancel_work_sync(). Since it's in remove() and the device parent's lock is always taken, the race against another binding should be suppressed even if we temporarily unlock the device lock there.
Below is the untested patch. It's a pity that the first patch I wrote in this year is something like this... ;)
thanks,
Takashi
-- 8< -- From: Takashi Iwai tiwai@suse.de Subject: [PATCH] ALSA: hda - Fix deadlock of controller device lock at unbinding
Imre Deak reported a deadlock of HD-audio driver at unbinding while it's still in probing. Since we probe the codecs asynchronously in a work, the codec driver probe may still be kicked off while the controller itself is being unbound. And, azx_remove() tries to process all pending tasks via cancel_work_sync() for fixing the other races (see commit [0b8c82190c12: ALSA: hda - Cancel probe work instead of flush at remove]), now we may meet a bizarre deadlock:
Unbind snd_hda_intel via sysfs: device_release_driver() -> device_lock(snd_hda_intel) -> azx_remove() -> cancel_work_sync(azx_probe_work)
azx_probe_work(): codec driver probe() -> __driver_attach() -> device_lock(snd_hda_intel)
This deadlock is caused by the fact that both device_release_driver() and driver_probe_device() take both the device and its parent locks at the same time. The codec device sets the controller device as its parent, and this lock is taken before the probe() callback is called, while the controller remove() callback gets called also with the same lock.
In this patch, as an ugly workaround, we unlock the controller device temporarily during cancel_work_sync() call. The race against another bind call should be still suppressed by the parent's device lock.
Reported-by: Imre Deak imre.deak@intel.com Fixes: 0b8c82190c12 ("ALSA: hda - Cancel probe work instead of flush at remove") Signed-off-by: Takashi Iwai tiwai@suse.de --- sound/pci/hda/hda_intel.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c index c64d986009a9..2587c197e353 100644 --- a/sound/pci/hda/hda_intel.c +++ b/sound/pci/hda/hda_intel.c @@ -2155,7 +2155,20 @@ static void azx_remove(struct pci_dev *pci) /* cancel the pending probing work */ chip = card->private_data; hda = container_of(chip, struct hda_intel, chip); + /* FIXME: below is an ugly workaround. + * Both device_release_driver() and driver_probe_device() + * take *both* the device's and its parent's lock before + * calling the remove() and probe() callbacks. The codec + * probe takes the locks of both the codec itself and its + * parent, i.e. the PCI controller dev. Meanwhile, when + * the PCI controller is unbound, it takes its lock, too + * ==> ouch, a deadlock! + * As a workaround, we unlock temporarily here the controller + * device during cancel_work_sync() call. + */ + device_unlock(&pci->dev); cancel_work_sync(&hda->probe_work); + device_lock(&pci->dev);
snd_card_free(card); }