[alsa-devel] race condition between dmaengine_pcm_dma_complete and snd_pcm_release

Mon Oct 12 14:01:10 CEST 2015

On 10/08/2015 03:46 PM, Matt Campbell wrote:
> Hi Lars,
> 
> Note: re-sending this message because I accidentally wasn't plain text
> email so it got bounced from the email list (thanks GMail). Sorry for
> the double mail to those that are CC'd.
> 
> Thanks for the info, this felt like a limitation to the DMA engine
> before I found this older thread so it's good to have that feeling
> validated.
> 
> Just for posterity I'll describe the situation I'm in with the RT
> patch. I'm on an i.MX6 solo processor which uses the imx-sdma driver.
> As far as I can tell, terminate_all properly shuts off the DMA (simply
> disabling the bit in the controller). The issue I'm having is the
> described in the following steps:
> 
> 1) An SDMA interrupt comes in, is handled by the top half and the
> bottom half tasklet is scheduled.
> 
> 2) Control returns to my RT process which is higher priority than the
> tasklet thread. My process decides it is time to shutdown and calls
> snd_pcm_close which will terminate the DMA and deallocates everything
> (including the substream which is the argument data to the tasklet)
> 
> 3) With my process out of the way the dma tasklet can run. It
> eventually calls dmaengine_pcm_dma_complete (or in my case
> imx_pcm_dma_complete). You'll run into problems when dereferencing the
> now invalid substream pointer.
> 
> I have one other note about your proposed fix. In
> dmaengine_pcm_dma_complete you can end up with xrun handling which
> will result in a reset of the stream including calling terminate_all
> on the DMA. I'm worried that even with an API call to the DMA system
> to synchronize completion of DMA tasklets, you could end up with a
> deadlock when this is called from the tasklet itself when resetting
> the stream. I think we need to put the tasklet synchronization just
> before snd_pcm_release is called (as you suggested in the older
> thread). This still makes me uneasy because deadlock is still possible
> if the tasklet takes locks that are held when snd_pcm_release is
> called. Thanks for the input!

Yes, very good analysis. This is the reason why we can't synchronize in
terminate_all() itself and have to introduce a separate synchronization
primitive to the API. Most users will be fine with terminating and syncing
at the same time, hence it makes sense to introduce a new API
terminate_all_sync() to cover those common cases. For cases where it is not
possible, e.g. because they either terminate the DMA from atomic context or
they can end up calling terminate_all() from the completion callback, a new
separate API function (e.g. dmaengine_synchronize()) will be introduced.
Users which use the asynchronous terminate API need to call
dmaengine_synchronize() before they free any resources that are accessed in
either the completion callback or the memory that is accessed by the DMA itself.

> This still makes me uneasy because deadlock is still possible
> if the tasklet takes locks that are held when snd_pcm_release is
> called. Thanks for the input

Good point. If that can happen we properly need to reference count the
struct that is passed to the complete callback.

> 
> On Thu, Oct 8, 2015 at 9:38 AM, Matt Campbell <mcampbell at izotope.com> wrote:
>> Hi Lars,
>>
>> Thanks for the info, this felt like a limitation to the DMA engine before I
>> found this older thread so it's good to have that feeling validated.
>>
>> Just for posterity I'll describe the situation I'm in with the RT patch. I'm
>> on an i.MX6 solo processor which uses the imx-sdma driver. As far as I can
>> tell, terminate_all properly shuts off the DMA (simply disabling the bit in
>> the controller). The issue I'm having is the described in the following
>> steps:
>>
>> 1) An SDMA interrupt comes in, is handled by the top half and the bottom
>> half tasklet is scheduled.
>>
>> 2) Control returns to my RT process which is higher priority than the
>> tasklet thread. My process decides it is time to shutdown and calls
>> snd_pcm_close which will terminate the DMA and deallocates everything
>> (including the substream which is the argument data to the tasklet)
>>
>> 3) With my process out of the way the dma tasklet can run. It eventually
>> calls dmaengine_pcm_dma_complete (or in my case imx_pcm_dma_complete).
>> You'll run into problems when dereferencing the now invalid substream
>> pointer.
>>
>> I have one other note about your proposed fix. In dmaengine_pcm_dma_complete
>> you can end up with xrun handling which will result in a reset of the stream
>> including calling terminate_all on the DMA. I'm worried that even with an
>> API call to the DMA system to synchronize completion of DMA tasklets, you
>> could end up with a deadlock when this is called from the tasklet itself
>> when resetting the stream. I think we need to put the tasklet
>> synchronization just before snd_pcm_release is called (as you suggested in
>> the older thread). This still makes me uneasy because deadlock is still
>> possible if the tasklet takes locks that are held when snd_pcm_release is
>> called. Thanks for the input!
>>
>> ~Matt
>>
>> On Thu, Oct 8, 2015 at 5:02 AM, Lars-Peter Clausen <lars at metafoo.de> wrote:
>>>
>>> On 10/06/2015 11:45 PM, Matt Campbell wrote:
>>>> Hi All,
>>>>
>>>> I've been trying to track down a kernel crash I've been getting when
>>>> closing ALSA devices. There is a race condition between the bottom half
>>>> interrupt handler for the DMA system (dmaengine_pcm_dma_complete in
>>>> pcm_dmaengine.c) and the releasing of the sub-stream resources it
>>>> receives
>>>> as an argument (when snd_pcm_release is called). An older thread from
>>>> 2013
>>>> discussed this to a good extent so I wont go into the details here. I've
>>>> been unable to track down either a patch to fix this or even a good
>>>> solution that I could implement. I've spent a couple days trying to
>>>> think
>>>> of an elegant solution and come up with nothing so far. Any input would
>>>> be
>>>> appreciated.
>>>>
>>>> Link to original thread:
>>>>
>>>> http://mailman.alsa-project.org/pipermail/alsa-devel/2013-October/067111.html
>>>>
>>>> It's worth nothing that the original thread points out how this can
>>>> arise
>>>> on multi-core systems. In my case I'm on a single core system, but with
>>>> the
>>>> real-time patch enabled and the userspace ALSA thread running at a
>>>> higher
>>>> priority than the kernel's tasklet thread which can reproduce this
>>>> almost
>>>> 100% of the time.
>>>
>>> Hi,
>>>
>>> The answer is still the same. This is an inherent issue with the DMAengine
>>> API that needs to be fixed by adding a synchronization primitive that
>>> allows
>>> consumers to make sure that all completion tasklets have stopped running.
>>> Unfortunately this wasn't implemented yet, but I need this in other places
>>> outside of ASoC and it's on my TODO list and I'll hopefully get to it in
>>> the
>>> next 2 months.
>>>
>>> But in your case on a single CPU system it could also be an issue with the
>>> DMA controller driver itself not properly stopping the transfer when
>>> dma_terminate_all() is called.
>>>
>>> - Lars
>>>
>>
>>
>>
>> --
>> Matthew Campbell
>> Senior Embedded Systems Engineer
>> mcampbell at izotope.com
>>
>> iZotope, Inc.
>> www.izotope.com
> 
> 
>