[PATCH] ASoC: dpcm: fix race condition to dpcm links in dpcm_be_dai_trigger

Wed Sep 29 23:01:12 CEST 2021

>> If routing change and underrun stop is run at the same time,
>> data abort can be occurred by the following sequence.
>>
>> CPU0: Processing underrun 	CPU1: Processing routing change
>> dpcm_be_dai_trigger():		dpcm_be_disconnect():
>>
>> for_each_dpcm_be(fe, stream, dpcm) {
>>
>> 				spin_lock_irqsave(&fe->card->dpcm_lock, flags);
>> 				list_del(&dpcm->list_be);
>> 				list_del(&dpcm->list_fe);
>> 				spin_unlock_irqrestore(&fe->card->dpcm_lock, flags);
>> 				kfree(dpcm);
>>
>> struct snd_soc_pcm_runtime *be = dpcm->be; <-- Accessing freed memory
>>
>> To prevent this situation, dpcm_lock is needed during accessing
>> the lists for dpcm links.
> 
> Isn't there still a possible inconsistency here introduced by the
> duplication of the BE list?
> 
> You protect the list creation, but before you use it in
> dpcm_be_dai_trigger(), there's a time window where the function could be
> pre-empted and a disconnect event might have happened. As a result you
> could trigger a BE that's no longer connected.
> 
> What you identified as a race is likely valid, but how to fix it isn't
> clear to me - the DPCM code isn't self-explanatory at all with its use
> in various places of the dpcm_lock spinlock, the pcm mutex, the card mutex.
> 
> Ideally we would need to find a way to prevent changes in connections
> while we are doing the triggers, but triggers can take a bit of time if
> they involve any sort of communication over a bus. I really wonder if
> this dpcm_lock should be a mutex and if the model for DPCM really
> involves interrupt contexts as the irqsave/irqrestore mentions hint at.

To follow-up on this, I started experimenting with a replacement of the
'dpcm_lock' spinlock with a 'dpcm_mutex', see
https://github.com/thesofproject/linux/pull/3186

If we combine both of our results, the 'right' solution might be to take
this mutex before every use of for_each_dpcm_be(), and unlock it at the
end of the loop, which additional changes to avoid re-taking the same
mutex in helper functions.

there's still a part in DPCM that I can't figure out, there is an
elaborate trick with an explicit comment

	/* if FE's runtime_update is already set, we're in race;
	 * process this trigger later at exit
	 */

Which looks like a missing mutex somewhere, or an overkill solution that
might never be needed.