[PATCH 1/6] Add ancillary bus support
Jason Gunthorpe
jgg at nvidia.com
Thu Oct 1 19:42:37 CEST 2020
On Thu, Oct 01, 2020 at 04:38:55PM +0200, Greg KH wrote:
> On Thu, Oct 01, 2020 at 11:33:34AM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 01, 2020 at 02:14:23PM +0200, Greg KH wrote:
> > > On Thu, Oct 01, 2020 at 08:58:47AM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Oct 01, 2020 at 01:05:51PM +0200, Greg KH wrote:
> > > >
> > > > > You have to be _VERY_ careful after calling
> > > > > ancillary_device_initialize(), as now you can not just free up the
> > > > > memory if something goes wrong before ancillary_device_add() is called,
> > > > > right?
> > > >
> > > > I've looked at way too many versions of this patch and related. This
> > > > is the only one so far that I didn't find various bugs on the error
> > > > cases.
> > >
> > > But you haven't seen the callers of this function. Without this
> > > documented, you will have problems.
> >
> > I've seen the Intel irdma, both versions of the SOF stuff and an
> > internal mlx5 patch..
> >
> > Look at the SOF example, it has perfectly paired error unwinds. Each
> > function has unwind that cleans up exactly what it creates. Every
> > 'free' unwind is paired with an 'alloc' in the same function. Simple.
> > Easy to audit. Easy to correctly enhance down the road.
> >
> > This is the common kernel goto error design pattern.
>
> But that's where people get this wrong.
People get everything wrong :( At least this pattern is easy to notice
and review.
> Once device_initialize() is called, the "free" can not be called,
> something else must be, device_put().
Yep!
However, with the one step device_register() pattern code usually
makes this class of mistake:
https://elixir.bootlin.com/linux/latest/source/drivers/firewire/core-device.c#L722
'goto skip_unit' does kfree() on something that already has been
device_initialized(). This is a real bug because this code called
dev_set_name() on line 713 and not doing the put_device() leaked the
name allocation. I think < v10 had this mistake.
dev_set_name() is a common error, here is another version:
https://elixir.bootlin.com/linux/latest/source/drivers/dma/idxd/cdev.c#L226
This correctly gets the switch to put_device() after
device_register(), but it calls kfree on line 220 after
dev_set_name(). This leaks memory too. Something like v16 of this
series had this bug as well.
BTW, want a patch to add a kref_read(dev->kref) == 0 warning to
dev_set_name() ? This seems pretty common, these were the first two
random choices from LXR I checked :\
> Sure, but without a real user that _NEEDS_ this two-step process, let's
> not include it. Why bake complexity into the system from the start that
> is never used?
It just needs to not have these common error unwind bugs :(
Jason
More information about the Alsa-devel
mailing list