RE: pcm|dmaengine|imx-sdma race condition on i.MX6

17 Aug 2020


      On 2020/08/17 15:29 Benjamin Bara - SKIDATA Benjamin.Bara@skidata.com wrote:
...
We think this is not an i.MX6-specific problem, but a problem of the
DMAengine usage from the PCM.
In case of a XRUN, the DMA channel is never closed but first a
SNDRV_PCM_TRIGGER_STOP next a SNDRV_PCM_TRIGGER_START is triggered.
The SNDRV_PCM_TRIGGER_STOP simply executes a
dmaengine_terminate_async() [1] but does not await the termination by calling
dmaengine_synchronize(), which is required as stated by the docu [2].
Anyways, we are not able to fix it in the pcm_dmaengine layer either at the
end of SNDRV_PCM_TRIGGER_STOP (called from the DMA on complete
interrupt handler) or at the beginning of SNDRV_PCM_TRIGGER_START (called
from a PCM ioctl), since the dmaengine_synchronize() requires a non-atomic
context.
Based on my understanding, most of the DMA implementations don't even
implement device_synchronize and if they do, it might not really be necessary
since the terminate_all operation is synchron.
With the i.MX6, it looks a bit different:
Since [4], the terminate_all operation really schedules a worker which waits
the required ~1ms and then does the context freeing.
Now, the ioctl(SNDRV_PCM_IOCTL_PREPARE) and the following
ioctl(SNDRV_PCM_IOCTL_READI_FRAMES),
which are called from US to handle/recover from a XRUN, are in a race with the
terminate_worker.
If the terminate_worker finishes earlier, everything is fine.
Otherwise, the sdma_prep_dma_cyclic() is called, sets up everything and as
soon as it is scheduled out to wait for data, the terminate_worker is scheduled
and kills it.
In this case, we wait in [5] until the timeout is reached and return with -EIO.
Based on our understanding, there exist two different fixing approaches:
We thought that the pcm_dmaengine should handle this by either
synchronizing the DMA on a trigger or terminating it synchronously.
However, as we are in an atomic context, we either have to give up the atomic
context of the PCM to finish the termination or we have to design a
synchronous terminate variant which is callable from an atomic context.
For the first option, which is potentially more performant, we have to leave the
atomic PCM context and we are not sure if we are allowed to.
For the second option, we would have to divide the dma_device terminate_all
into an atomic sync and an async one, which would align with the dmaengine
API, giving it the option to ensure termination in an atomic context.
Based on my understanding, most of them are synchronous anyways, for the
currently async ones we would have to implement busy waits.
However, with this approach, we reach the WARN_ON [6] inside of an atomic
context, indicating we might not do the right thing.
Ad Failure Log (at bottom):
I haven't added the ioctl syscalls, but this is basically the output with additional
prints to be able to follow the code execution path.
A XRUN (buffer size is 480 but 960 available) leads to a
SNDRV_PCM_TRIGGER_STOP.
This leads to terminate_async, starting the terminate_worker.
Next, the XRUN recovery triggers SNDRV_PCM_TRIGGER_START, calling
sdma_prep_dma_cyclic and then waits for the DMA in wait_for_avail().
Next we see the two freeings, first the old, then the newly added one; so the
terminate_worker is back at work.
Now the DMA is terminated, while we are still waiting on data from it.
What do you think about it? Is any of the provided solutions practicable?
If you need further information or additional logging, feel free to ask.
busy_wait is not good I think, would you please have a try with the attached patch
which is based on https://lkml.org/lkml/2020/8/11/111? The basic idea is
to keep the freed descriptor into another list for freeing in later terminate_worker
instead of freeing directly all in terminate_worker by vchan_get_all_descriptors
which may break next descriptor coming soon

RE: pcm|dmaengine|imx-sdma race condition on i.MX6

Robin Gong