Conceptual bug on SoundWire probe/remove?
Hi, I could use feedback/guidance on a possible conceptual bug in the SoundWire probe and bus handling.
When we probe a driver, the code does this:
static int sdw_drv_probe(struct device *dev) { struct sdw_slave *slave = dev_to_sdw_dev(dev); struct sdw_driver *drv = drv_to_sdw_driver(dev->driver); const struct sdw_device_id *id; const char *name; int ret;
/* * fw description is mandatory to bind */ if (!dev->fwnode) return -ENODEV;
if (!IS_ENABLED(CONFIG_ACPI) && !dev->of_node) return -ENODEV;
id = sdw_get_device_id(slave, drv); if (!id) return -ENODEV;
slave->ops = drv->ops;
The last line is the problematic one. If at some point, the user does an rmmod and unbinds the SoundWire codec driver, the .remove will be called and the 'drv' will no longer be valid, but we will still have a reference to drv->ops and use that pointer in the bus code, e.g.
/* Update the Slave driver */ if (slave_notify && slave->ops && slave->ops->interrupt_callback) { slave_intr.sdca_cascade = sdca_cascade; slave_intr.control_port = clear; memcpy(slave_intr.port, &port_status, sizeof(slave_intr.port));
slave->ops->interrupt_callback(slave, &slave_intr); }
I noodled with a potential fix in https://github.com/thesofproject/linux/pull/3534/commits/82d64fb0fd39b532263...
where I force-reset this slave->ops pointer, but it is likely to be very racy.
We probably need to avoid such references, or have a clean mechanism to unbind, e.g. with all commands and interrupts stopped while the codec driver .remove routine is handled.
Initial error reports at https://github.com/thesofproject/linux/issues/3531
Suggestions and comments welcome, thanks! -Pierre
On Wed, Mar 23, 2022 at 02:45:59PM -0500, Pierre-Louis Bossart wrote:
The last line is the problematic one. If at some point, the user does an rmmod and unbinds the SoundWire codec driver, the .remove will be called and the 'drv' will no longer be valid, but we will still have a reference to drv->ops and use that pointer in the bus code, e.g.
...
where I force-reset this slave->ops pointer, but it is likely to be very racy.
We probably need to avoid such references, or have a clean mechanism to unbind, e.g. with all commands and interrupts stopped while the codec driver .remove routine is handled.
Your analysis seems pretty much spot on - you'll need locking or other measures to make sure there are no live callbacks from the bus while the device is being removed. It's a fairly standard problem unfortunately.
On 23/03/2022 19:45, Pierre-Louis Bossart wrote:
Hi, I could use feedback/guidance on a possible conceptual bug in the SoundWire probe and bus handling.
When we probe a driver, the code does this:
static int sdw_drv_probe(struct device *dev) { struct sdw_slave *slave = dev_to_sdw_dev(dev); struct sdw_driver *drv = drv_to_sdw_driver(dev->driver); const struct sdw_device_id *id; const char *name; int ret;
/* * fw description is mandatory to bind */ if (!dev->fwnode) return -ENODEV;
if (!IS_ENABLED(CONFIG_ACPI) && !dev->of_node) return -ENODEV;
id = sdw_get_device_id(slave, drv); if (!id) return -ENODEV;
slave->ops = drv->ops;
may be add try_module_get(dev->driver->owner) here, which should prevent the module from unloading in the first place.
--srini
The last line is the problematic one. If at some point, the user does an rmmod and unbinds the SoundWire codec driver, the .remove will be called and the 'drv' will no longer be valid, but we will still have a reference to drv->ops and use that pointer in the bus code, e.g.
/* Update the Slave driver */ if (slave_notify && slave->ops && slave->ops->interrupt_callback) { slave_intr.sdca_cascade = sdca_cascade; slave_intr.control_port = clear; memcpy(slave_intr.port, &port_status, sizeof(slave_intr.port));
slave->ops->interrupt_callback(slave, &slave_intr); }
I noodled with a potential fix in https://github.com/thesofproject/linux/pull/3534/commits/82d64fb0fd39b532263...
where I force-reset this slave->ops pointer, but it is likely to be very racy.
We probably need to avoid such references, or have a clean mechanism to unbind, e.g. with all commands and interrupts stopped while the codec driver .remove routine is handled.
Initial error reports at https://github.com/thesofproject/linux/issues/3531
Suggestions and comments welcome, thanks! -Pierre
On Thu, 24 Mar 2022 11:12:34 +0100, Srinivas Kandagatla wrote:
On 23/03/2022 19:45, Pierre-Louis Bossart wrote:
Hi, I could use feedback/guidance on a possible conceptual bug in the SoundWire probe and bus handling.
When we probe a driver, the code does this:
static int sdw_drv_probe(struct device *dev) { struct sdw_slave *slave = dev_to_sdw_dev(dev); struct sdw_driver *drv = drv_to_sdw_driver(dev->driver); const struct sdw_device_id *id; const char *name; int ret;
/* * fw description is mandatory to bind */ if (!dev->fwnode) return -ENODEV;
if (!IS_ENABLED(CONFIG_ACPI) && !dev->of_node) return -ENODEV;
id = sdw_get_device_id(slave, drv); if (!id) return -ENODEV;
slave->ops = drv->ops;
may be add try_module_get(dev->driver->owner) here, which should prevent the module from unloading in the first place.
The module refcount helps only partially, unfortunately. The unbindig via sysfs is still allowed and it hits the very same problem.
Takashi
On Wed, Mar 23, 2022 at 02:45:59PM -0500, Pierre-Louis Bossart wrote:
Hi, I could use feedback/guidance on a possible conceptual bug in the SoundWire probe and bus handling.
When we probe a driver, the code does this:
static int sdw_drv_probe(struct device *dev) { struct sdw_slave *slave = dev_to_sdw_dev(dev); struct sdw_driver *drv = drv_to_sdw_driver(dev->driver); const struct sdw_device_id *id; const char *name; int ret;
/* * fw description is mandatory to bind */ if (!dev->fwnode) return -ENODEV;
if (!IS_ENABLED(CONFIG_ACPI) && !dev->of_node) return -ENODEV;
id = sdw_get_device_id(slave, drv); if (!id) return -ENODEV;
slave->ops = drv->ops;
That is wrong and should never happen as you lost all reference counting. Please don't do that.
The last line is the problematic one. If at some point, the user does an rmmod and unbinds the SoundWire codec driver, the .remove will be called and the 'drv' will no longer be valid, but we will still have a reference to drv->ops and use that pointer in the bus code, e.g.
/* Update the Slave driver */ if (slave_notify && slave->ops && slave->ops->interrupt_callback) { slave_intr.sdca_cascade = sdca_cascade; slave_intr.control_port = clear; memcpy(slave_intr.port, &port_status, sizeof(slave_intr.port)); slave->ops->interrupt_callback(slave,
&slave_intr); }
I noodled with a potential fix in https://github.com/thesofproject/linux/pull/3534/commits/82d64fb0fd39b532263...
where I force-reset this slave->ops pointer, but it is likely to be very racy.
Just properly reference count things and this should be ok. You can NEVER just save off a pointer to a random thing that you do not own without increasing the reference count, otherwise bad things will happen. It's always been this way.
thanks,
greg k-h
On 3/24/22 06:15, Greg Kroah-Hartman wrote:
On Wed, Mar 23, 2022 at 02:45:59PM -0500, Pierre-Louis Bossart wrote:
Hi, I could use feedback/guidance on a possible conceptual bug in the SoundWire probe and bus handling.
When we probe a driver, the code does this:
static int sdw_drv_probe(struct device *dev) { struct sdw_slave *slave = dev_to_sdw_dev(dev); struct sdw_driver *drv = drv_to_sdw_driver(dev->driver); const struct sdw_device_id *id; const char *name; int ret;
/* * fw description is mandatory to bind */ if (!dev->fwnode) return -ENODEV;
if (!IS_ENABLED(CONFIG_ACPI) && !dev->of_node) return -ENODEV;
id = sdw_get_device_id(slave, drv); if (!id) return -ENODEV;
slave->ops = drv->ops;
That is wrong and should never happen as you lost all reference counting. Please don't do that.
ok, so I think we all agree on the issue. It's not new code, it's been there since December 2017 and 4.16
The last line is the problematic one. If at some point, the user does an rmmod and unbinds the SoundWire codec driver, the .remove will be called and the 'drv' will no longer be valid, but we will still have a reference to drv->ops and use that pointer in the bus code, e.g.
/* Update the Slave driver */ if (slave_notify && slave->ops && slave->ops->interrupt_callback) { slave_intr.sdca_cascade = sdca_cascade; slave_intr.control_port = clear; memcpy(slave_intr.port, &port_status, sizeof(slave_intr.port)); slave->ops->interrupt_callback(slave,
&slave_intr); }
I noodled with a potential fix in https://github.com/thesofproject/linux/pull/3534/commits/82d64fb0fd39b532263...
where I force-reset this slave->ops pointer, but it is likely to be very racy.
Just properly reference count things and this should be ok. You can NEVER just save off a pointer to a random thing that you do not own without increasing the reference count, otherwise bad things will happen. It's always been this way.
but I am not following the reference count recommendation in that specific case. If we increase a reference count in the probe, wouldn't that prevent the unbind/remove?
On Fri, Mar 25, 2022 at 10:51:51AM -0500, Pierre-Louis Bossart wrote:
On 3/24/22 06:15, Greg Kroah-Hartman wrote:
On Wed, Mar 23, 2022 at 02:45:59PM -0500, Pierre-Louis Bossart wrote:
Hi, I could use feedback/guidance on a possible conceptual bug in the SoundWire probe and bus handling.
When we probe a driver, the code does this:
static int sdw_drv_probe(struct device *dev) { struct sdw_slave *slave = dev_to_sdw_dev(dev); struct sdw_driver *drv = drv_to_sdw_driver(dev->driver); const struct sdw_device_id *id; const char *name; int ret;
/* * fw description is mandatory to bind */ if (!dev->fwnode) return -ENODEV;
if (!IS_ENABLED(CONFIG_ACPI) && !dev->of_node) return -ENODEV;
id = sdw_get_device_id(slave, drv); if (!id) return -ENODEV;
slave->ops = drv->ops;
That is wrong and should never happen as you lost all reference counting. Please don't do that.
ok, so I think we all agree on the issue. It's not new code, it's been there since December 2017 and 4.16
It's hard to notice that in code review :(
The last line is the problematic one. If at some point, the user does an rmmod and unbinds the SoundWire codec driver, the .remove will be called and the 'drv' will no longer be valid, but we will still have a reference to drv->ops and use that pointer in the bus code, e.g.
/* Update the Slave driver */ if (slave_notify && slave->ops && slave->ops->interrupt_callback) { slave_intr.sdca_cascade = sdca_cascade; slave_intr.control_port = clear; memcpy(slave_intr.port, &port_status, sizeof(slave_intr.port)); slave->ops->interrupt_callback(slave,
&slave_intr); }
I noodled with a potential fix in https://github.com/thesofproject/linux/pull/3534/commits/82d64fb0fd39b532263...
where I force-reset this slave->ops pointer, but it is likely to be very racy.
Just properly reference count things and this should be ok. You can NEVER just save off a pointer to a random thing that you do not own without increasing the reference count, otherwise bad things will happen. It's always been this way.
but I am not following the reference count recommendation in that specific case. If we increase a reference count in the probe, wouldn't that prevent the unbind/remove?
Depends on what you are increasing the reference count of :)
bind/unbind has nothing to do with the reference count of objects, it only prevents the devices/whatever from being removed from the system (hopefully.)
When doing "give me the driver ops of that driver over there" you have to be VERY careful. It's a very uncommon pattern that I can't think of anyone else doing outside of a bus logic. I don't recommend it, instead grab references to the devices themselves and go through the device and what is bound/unbound to that. Also keep things in sync when you grab those other devices, that can also get messy.
How about just not doing this at all?
On 3/25/22 11:26, Greg Kroah-Hartman wrote:
On Fri, Mar 25, 2022 at 10:51:51AM -0500, Pierre-Louis Bossart wrote:
On 3/24/22 06:15, Greg Kroah-Hartman wrote:
On Wed, Mar 23, 2022 at 02:45:59PM -0500, Pierre-Louis Bossart wrote:
Hi, I could use feedback/guidance on a possible conceptual bug in the SoundWire probe and bus handling.
When we probe a driver, the code does this:
static int sdw_drv_probe(struct device *dev) { struct sdw_slave *slave = dev_to_sdw_dev(dev); struct sdw_driver *drv = drv_to_sdw_driver(dev->driver); const struct sdw_device_id *id; const char *name; int ret;
/* * fw description is mandatory to bind */ if (!dev->fwnode) return -ENODEV;
if (!IS_ENABLED(CONFIG_ACPI) && !dev->of_node) return -ENODEV;
id = sdw_get_device_id(slave, drv); if (!id) return -ENODEV;
slave->ops = drv->ops;
That is wrong and should never happen as you lost all reference counting. Please don't do that.
ok, so I think we all agree on the issue. It's not new code, it's been there since December 2017 and 4.16
It's hard to notice that in code review :(
The last line is the problematic one. If at some point, the user does an rmmod and unbinds the SoundWire codec driver, the .remove will be called and the 'drv' will no longer be valid, but we will still have a reference to drv->ops and use that pointer in the bus code, e.g.
/* Update the Slave driver */ if (slave_notify && slave->ops && slave->ops->interrupt_callback) { slave_intr.sdca_cascade = sdca_cascade; slave_intr.control_port = clear; memcpy(slave_intr.port, &port_status, sizeof(slave_intr.port)); slave->ops->interrupt_callback(slave,
&slave_intr); }
I noodled with a potential fix in https://github.com/thesofproject/linux/pull/3534/commits/82d64fb0fd39b532263...
where I force-reset this slave->ops pointer, but it is likely to be very racy.
Just properly reference count things and this should be ok. You can NEVER just save off a pointer to a random thing that you do not own without increasing the reference count, otherwise bad things will happen. It's always been this way.
but I am not following the reference count recommendation in that specific case. If we increase a reference count in the probe, wouldn't that prevent the unbind/remove?
Depends on what you are increasing the reference count of :)
bind/unbind has nothing to do with the reference count of objects, it only prevents the devices/whatever from being removed from the system (hopefully.)
When doing "give me the driver ops of that driver over there" you have to be VERY careful. It's a very uncommon pattern that I can't think of anyone else doing outside of a bus logic. I don't recommend it, instead grab references to the devices themselves and go through the device and what is bound/unbound to that. Also keep things in sync when you grab those other devices, that can also get messy.
How about just not doing this at all?
That's what I was exploring, I am not sure at all we need to use this pointer at all.
Maybe we should go back to the problem statement:
The SoundWire codec driver provides a set of callbacks that need to be invoked from bus layer when events are detected (bus reconfiguration, clock about to be stopped, attach/detach and interrupt).
The devices are created upfront, and if there's no driver things will work just fine. When we bind a driver, then these callbacks are detected as not null.
The part that I am not familiar with is how we can prevent a remove/unbind while invoking the callbacks.
Maybe we can indeed increase the refcount on the device, follow the device->driver->ops pointers and reduce the refcount once the callback completes.
On Fri, Mar 25, 2022 at 11:51:52AM -0500, Pierre-Louis Bossart wrote:
On 3/25/22 11:26, Greg Kroah-Hartman wrote:
On Fri, Mar 25, 2022 at 10:51:51AM -0500, Pierre-Louis Bossart wrote:
On 3/24/22 06:15, Greg Kroah-Hartman wrote:
On Wed, Mar 23, 2022 at 02:45:59PM -0500, Pierre-Louis Bossart wrote:
Hi, I could use feedback/guidance on a possible conceptual bug in the SoundWire probe and bus handling.
When we probe a driver, the code does this:
static int sdw_drv_probe(struct device *dev) { struct sdw_slave *slave = dev_to_sdw_dev(dev); struct sdw_driver *drv = drv_to_sdw_driver(dev->driver); const struct sdw_device_id *id; const char *name; int ret;
/* * fw description is mandatory to bind */ if (!dev->fwnode) return -ENODEV;
if (!IS_ENABLED(CONFIG_ACPI) && !dev->of_node) return -ENODEV;
id = sdw_get_device_id(slave, drv); if (!id) return -ENODEV;
slave->ops = drv->ops;
That is wrong and should never happen as you lost all reference counting. Please don't do that.
ok, so I think we all agree on the issue. It's not new code, it's been there since December 2017 and 4.16
It's hard to notice that in code review :(
The last line is the problematic one. If at some point, the user does an rmmod and unbinds the SoundWire codec driver, the .remove will be called and the 'drv' will no longer be valid, but we will still have a reference to drv->ops and use that pointer in the bus code, e.g.
/* Update the Slave driver */ if (slave_notify && slave->ops && slave->ops->interrupt_callback) { slave_intr.sdca_cascade = sdca_cascade; slave_intr.control_port = clear; memcpy(slave_intr.port, &port_status, sizeof(slave_intr.port)); slave->ops->interrupt_callback(slave,
&slave_intr); }
I noodled with a potential fix in https://github.com/thesofproject/linux/pull/3534/commits/82d64fb0fd39b532263...
where I force-reset this slave->ops pointer, but it is likely to be very racy.
Just properly reference count things and this should be ok. You can NEVER just save off a pointer to a random thing that you do not own without increasing the reference count, otherwise bad things will happen. It's always been this way.
but I am not following the reference count recommendation in that specific case. If we increase a reference count in the probe, wouldn't that prevent the unbind/remove?
Depends on what you are increasing the reference count of :)
bind/unbind has nothing to do with the reference count of objects, it only prevents the devices/whatever from being removed from the system (hopefully.)
When doing "give me the driver ops of that driver over there" you have to be VERY careful. It's a very uncommon pattern that I can't think of anyone else doing outside of a bus logic. I don't recommend it, instead grab references to the devices themselves and go through the device and what is bound/unbound to that. Also keep things in sync when you grab those other devices, that can also get messy.
How about just not doing this at all?
That's what I was exploring, I am not sure at all we need to use this pointer at all.
Maybe we should go back to the problem statement:
The SoundWire codec driver provides a set of callbacks that need to be invoked from bus layer when events are detected (bus reconfiguration, clock about to be stopped, attach/detach and interrupt).
The devices are created upfront, and if there's no driver things will work just fine. When we bind a driver, then these callbacks are detected as not null.
Not null from where? And why is this different than any other bus?
The part that I am not familiar with is how we can prevent a remove/unbind while invoking the callbacks.
Who is calling these callbacks? From what path? bind/unbind happens through the driver core, into your bus code. So you have a lock for this, it should be serialized and should not be happening while some other bus callback is happening from the driver core.
If not, what specific path are you worried about here?
Maybe we can indeed increase the refcount on the device, follow the device->driver->ops pointers and reduce the refcount once the callback completes.
No, again, refcounts hold the device in place in memory, they have NOTHING to do with the driver binding to the device. Even if you have a refcount on a device, the driver can be unbound.
thanks,
greg k-h
participants (5)
-
Greg Kroah-Hartman
-
Mark Brown
-
Pierre-Louis Bossart
-
Srinivas Kandagatla
-
Takashi Iwai