On Thu, Oct 01, 2020 at 04:38:55PM +0200, Greg KH wrote:
On Thu, Oct 01, 2020 at 11:33:34AM -0300, Jason Gunthorpe wrote:
On Thu, Oct 01, 2020 at 02:14:23PM +0200, Greg KH wrote:
On Thu, Oct 01, 2020 at 08:58:47AM -0300, Jason Gunthorpe wrote:
On Thu, Oct 01, 2020 at 01:05:51PM +0200, Greg KH wrote:
You have to be _VERY_ careful after calling ancillary_device_initialize(), as now you can not just free up the memory if something goes wrong before ancillary_device_add() is called, right?
I've looked at way too many versions of this patch and related. This is the only one so far that I didn't find various bugs on the error cases.
But you haven't seen the callers of this function. Without this documented, you will have problems.
I've seen the Intel irdma, both versions of the SOF stuff and an internal mlx5 patch..
Look at the SOF example, it has perfectly paired error unwinds. Each function has unwind that cleans up exactly what it creates. Every 'free' unwind is paired with an 'alloc' in the same function. Simple. Easy to audit. Easy to correctly enhance down the road.
This is the common kernel goto error design pattern.
But that's where people get this wrong.
People get everything wrong :( At least this pattern is easy to notice and review.
Once device_initialize() is called, the "free" can not be called, something else must be, device_put().
Yep!
However, with the one step device_register() pattern code usually makes this class of mistake:
https://elixir.bootlin.com/linux/latest/source/drivers/firewire/core-device....
'goto skip_unit' does kfree() on something that already has been device_initialized(). This is a real bug because this code called dev_set_name() on line 713 and not doing the put_device() leaked the name allocation. I think < v10 had this mistake.
dev_set_name() is a common error, here is another version:
https://elixir.bootlin.com/linux/latest/source/drivers/dma/idxd/cdev.c#L226
This correctly gets the switch to put_device() after device_register(), but it calls kfree on line 220 after dev_set_name(). This leaks memory too. Something like v16 of this series had this bug as well.
BTW, want a patch to add a kref_read(dev->kref) == 0 warning to dev_set_name() ? This seems pretty common, these were the first two random choices from LXR I checked :\
Sure, but without a real user that _NEEDS_ this two-step process, let's not include it. Why bake complexity into the system from the start that is never used?
It just needs to not have these common error unwind bugs :(
Jason