Hi Pierre,
Hi Libin,
Please let me describe the issue here.
The test case is:
- Unload module with script "sudo ./sof_remove.sh" ,
- reload module with script "sudo ./sof_insert.sh"
After several rounds of removing and inserting kernel modules, system will complain like below: "BUG: unable to handle kernel paging request at 000000292a282031"
Did you try some kernel debug options? It might show what went wrong.
No, I haven't. I'm not sure which options I can use for this case. Could you please give me some suggestions?
You can enable CONFIG_DEBUG_DEVRES and adjust the devres.log option for showing each devres allocation and removal. And I'd try CONFIG_DEBUG_SLAB and CONFIG_DEBUG_KMEMLEAK or whatever
interesting
in CONFIG_DEBUG section, too.
Thanks for your suggestion. After more than 1 week debug, I think maybe I have root caused this issue from the devres.log message.
Below is my finding.
- When initialing the codecs, snd_hdac_ext_bus_device_init() will be
called, and it will set hdev->dev.release = default_release. However, for analog codec (not hdac_hdmi codec), hdac_hda_codec_probe() will be called later. And it will call snd_hda_codec_device_new(), which will reset codec->code.dev.release = snd_hda_codec_dev_release; This means hdac_hdmi's hdev dev release is default_release() defined in hdac_ext_bus.c, and other hda codec's hdev dev release is snd_hda_codec_dev_release().
Both default_release() and snd_hda_codec_dev_release() will call kfree() to free the hdac_device (or its container) in the current code.
- When we run rmmod sof_pci_dev, it will free the sdev. If we use
Struct hdac_device *hdev = devm_kzalloc(sdev->dev...). This means hdev will also be freed automatically.
In the removal, snd_hdac_ext_bus_device_remove() will be called to remove the hdev (in this function it is struct hdac_device *codec. The name is not aligned in different places). However for hdac_hdmi, the hdev->dev is used by other code.
what other code? Can you elaborate on why the release is delayed?
From the dmesg, it is the device_link used for hdac_hdmi that causes the release delay. Device_link will increase the refcount.
So calling device.release() (the function default_release()) will be postponed. After after sdev is freed, the device.release() will be called. But for devm_xxx, hdev will also be freed when sdev is freed. This means hdev.dev after sdev is freed is invalid now as hdev has already freed. It will access invalid memory. This will cause the bug.
This is very hard to follow. 4 lines above you wrote the release is postponed but the way you describe is looks completely sequential.
For hdac_hdmi, as it is used by other code (device_link), the device release will not be called immediately. After sdev and hdev is freed, the hdev device is put by device_link. Then hdev device can be released. But hdev has already be freed. This is wrong.
So I think we should not use devm_xxx, and let's free the hdev manually.
At the end of this topic, I still found 2 suspicious code in the current code.
- in sound/soc/intel/skylate/skl.c
it calls hdev = devm_kazlloc() or hda_codec = devm_kzalloc(). As we will call kfree() in the current code, should we replace it with kzalloc()? Maybe we need cavs drivers owner's help on it.
maybe you should send a diff suggestion to help everyone understand the changes you are referring to?
Please see my inline patch:
diff --git a/sound/soc/intel/skylake/skl.c b/sound/soc/intel/skylake/skl.c index f864f7b..12af99a 100644 --- a/sound/soc/intel/skylake/skl.c +++ b/sound/soc/intel/skylake/skl.c @@ -697,8 +697,7 @@ static int probe_codec(struct hdac_bus *bus, int addr) dev_dbg(bus->dev, "codec #%d probed OK: %x\n", addr, res);
#if IS_ENABLED(CONFIG_SND_SOC_INTEL_SKYLAKE_HDAUDIO_CODEC) - hda_codec = devm_kzalloc(&skl->pci->dev, sizeof(*hda_codec), - GFP_KERNEL); + hda_codec = kzalloc(sizeof(*hda_codec), GFP_KERNEL); if (!hda_codec) return -ENOMEM;
@@ -716,7 +715,7 @@ static int probe_codec(struct hdac_bus *bus, int addr) } return 0; #else - hdev = devm_kzalloc(&skl->pci->dev, sizeof(*hdev), GFP_KERNEL); + hdev = kzalloc(sizeof(*hdev), GFP_KERNEL); if (!hdev) return -ENOMEM;
- in snd_hdac_ext_bus_device_remove() It will call
snd_hdac_device_unregister() to unregister the hdac_devices and put_device(&codec->dev) to release the dev. For analog codec, snd_hdac_device_unregister() will free the codec->dev's kobject. And snd_hda_codec_dev_release() will be called to free the hdac_device. So it is invalid to call put_device(&codec->dev). If you print refcound_read(&(codec->dev.kobj.kref.refcount)) for analog codec before put_device(), you will find the refcount has already been 0.
Isn't it a different problem though? Does this cause a freeze or is this just a bad refcount?
Yes, it is a totally different problem. And today morning, I found Ranjani has a patch " ASoC: hda: increment codec device refcount when it is added to the card ", which should fix this bug.
Regards, Libin