for_each_rtd_components(rtd, rtdcom, component) {
pr_err("plb: %s processing component\n", __func__);
if (!component)
pr_err("plb: %s component is NULL\n", __func__);
Could you perhaps add traces of which components are being accessed at each stage? We might want to go through and just add something like that in the code anyway to help figure things out.
I tried to add more traces but couldn't triangulate on a clear issue, and the traces have an Heisenbug effect.
So I switched to higher-level code analysis: it turns out that soc_dai_link_remove() routine is called from both topology and on card cleanup.
The patch 06/19 in this series essentially forces the pcm_runtimes to be freed in both cases, so possibly twice for topology-managed dailinks - or using information that's been freed already.
I 'fixed' this by adding an additional parameter to avoid doing the pcm runtime free from the topology (as was done before), and the kernel oops goes away. My tests have been running for 45mn now, when without change I get a kernel oops in less than 10-20 cycles (but still more than apparently our CI tracks, something to improve).
I pushed the code on GitHub to check if there are any negative points reported by the Intel CI, should be complete shortly: https://github.com/thesofproject/linux/pull/1469
I am not sure the suggested fix is correct, I don't fully get what the topology and card cleanups should do and how the work is split, if at all.