[PATCH 1/6] Add ancillary bus support

Jason Gunthorpe jgg at nvidia.com
Thu Oct 1 19:42:37 CEST 2020


On Thu, Oct 01, 2020 at 04:38:55PM +0200, Greg KH wrote:
> On Thu, Oct 01, 2020 at 11:33:34AM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 01, 2020 at 02:14:23PM +0200, Greg KH wrote:
> > > On Thu, Oct 01, 2020 at 08:58:47AM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Oct 01, 2020 at 01:05:51PM +0200, Greg KH wrote:
> > > >  
> > > > > You have to be _VERY_ careful after calling
> > > > > ancillary_device_initialize(), as now you can not just free up the
> > > > > memory if something goes wrong before ancillary_device_add() is called,
> > > > > right?
> > > > 
> > > > I've looked at way too many versions of this patch and related. This
> > > > is the only one so far that I didn't find various bugs on the error
> > > > cases.
> > > 
> > > But you haven't seen the callers of this function.  Without this
> > > documented, you will have problems.
> > 
> > I've seen the Intel irdma, both versions of the SOF stuff and an
> > internal mlx5 patch..
> > 
> > Look at the SOF example, it has perfectly paired error unwinds. Each
> > function has unwind that cleans up exactly what it creates. Every
> > 'free' unwind is paired with an 'alloc' in the same function. Simple.
> > Easy to audit. Easy to correctly enhance down the road. 
> > 
> > This is the common kernel goto error design pattern.
> 
> But that's where people get this wrong. 

People get everything wrong :( At least this pattern is easy to notice
and review.

> Once device_initialize() is called, the "free" can not be called,
> something else must be, device_put().

Yep! 

However, with the one step device_register() pattern code usually
makes this class of mistake:

https://elixir.bootlin.com/linux/latest/source/drivers/firewire/core-device.c#L722

'goto skip_unit' does kfree() on something that already has been
device_initialized(). This is a real bug because this code called
dev_set_name() on line 713 and not doing the put_device() leaked the
name allocation. I think < v10 had this mistake.

dev_set_name() is a common error, here is another version:

https://elixir.bootlin.com/linux/latest/source/drivers/dma/idxd/cdev.c#L226

This correctly gets the switch to put_device() after
device_register(), but it calls kfree on line 220 after
dev_set_name(). This leaks memory too. Something like v16 of this
series had this bug as well.

BTW, want a patch to add a kref_read(dev->kref) == 0 warning to
dev_set_name() ? This seems pretty common, these were the first two
random choices from LXR I checked :\

> Sure, but without a real user that _NEEDS_ this two-step process, let's
> not include it.  Why bake complexity into the system from the start that
> is never used?

It just needs to not have these common error unwind bugs :(

Jason


More information about the Alsa-devel mailing list