On Thu, 23 May 2019 10:03:03 +0200, Yang, Libin wrote:
Please let me describe the issue here.
The test case is:
- Unload module with script "sudo ./sof_remove.sh" ,
- reload module with script "sudo ./sof_insert.sh"
After several rounds of removing and inserting kernel modules, system will complain like below: "BUG: unable to handle kernel paging request at 000000292a282031"
Did you try some kernel debug options? It might show what went wrong.
No, I haven't. I'm not sure which options I can use for this case. Could you please give me some suggestions?
You can enable CONFIG_DEBUG_DEVRES and adjust the devres.log option
for
showing each devres allocation and removal. And I'd try CONFIG_DEBUG_SLAB and CONFIG_DEBUG_KMEMLEAK or whatever
interesting in
CONFIG_DEBUG section, too.
Thanks for your suggestion. After more than 1 week debug, I think maybe I have root caused this issue from the devres.log message.
Below is my finding.
- When initialing the codecs, snd_hdac_ext_bus_device_init() will be called,
and it will set hdev->dev.release = default_release. However, for analog codec (not hdac_hdmi codec), hdac_hda_codec_probe() will be called later. And it will call snd_hda_codec_device_new(), which will reset codec->code.dev.release = snd_hda_codec_dev_release; This means hdac_hdmi's hdev dev release is default_release() defined in hdac_ext_bus.c, and other hda codec's hdev dev release is snd_hda_codec_dev_release().
Both default_release() and snd_hda_codec_dev_release() will call kfree() to free the hdac_device (or its container) in the current code.
- When we run rmmod sof_pci_dev, it will free the sdev. If we use Struct
hdac_device *hdev = devm_kzalloc(sdev->dev...). This means hdev will also be freed automatically.
In the removal, snd_hdac_ext_bus_device_remove() will be called to remove the hdev (in this function it is struct hdac_device *codec. The name is not aligned in different places). However for hdac_hdmi, the hdev->dev is used by other code. So calling device.release() (the function default_release()) will be postponed. After after sdev is freed, the device.release() will be called. But for devm_xxx, hdev will also be freed when sdev is freed. This means hdev.dev after sdev is freed is invalid now as hdev has already freed. It will access invalid memory. This will cause the bug.
So I think we should not use devm_xxx, and let's free the hdev manually.
At the end of this topic, I still found 2 suspicious code in the current code.
- in sound/soc/intel/skylate/skl.c
it calls hdev = devm_kazlloc() or hda_codec = devm_kzalloc(). As we will call kfree() in the current code, should we replace it with kzalloc()? Maybe we need cavs drivers owner's help on it.
- in snd_hdac_ext_bus_device_remove()
It will call snd_hdac_device_unregister() to unregister the hdac_devices and put_device(&codec->dev) to release the dev. For analog codec, snd_hdac_device_unregister() will free the codec->dev's kobject. And snd_hda_codec_dev_release() will be called to free the hdac_device. So it is invalid to call put_device(&codec->dev). If you print refcound_read(&(codec->dev.kobj.kref.refcount)) for analog codec before put_device(), you will find the refcount has already been 0.
It seems Ranjani's " ASoC: hda: increment codec device refcount when it is added to the card" patch can fix the second issue.
Regards, Libin