I added debug messages to print the RIRBWP register and realize that response could come between the read of RIRBWP in the snd_hdac_bus_update_rirb() function and the interrupt clear in the hda_dsp_stream_interrupt() function. The response is not handled but the interrupt is already cleared. It will cause timeout unless more responses coming to RIRB.
Now I noticed that the legacy driver already addressed it recently via commit 6d011d5057ff ALSA: hda: Clear RIRB status before reading WP
We should have checked SOF at the same time, too...
Thanks, Takashi. But the legacy driver but doesnt remove the loop. The loop added in the SOF driver was based on the legacy driver and specifically to handle missed stream interrupts. Is there any harm in keeping the loop?
A loop there might be safer to keep, indeed. That's basically for a difference kind of race, and it can still happen theoretically.
Though, SOF is with the threaded interrupt, and it's interesting how the behavior differs. I can imagine that, if a thread irq is running while a new IRQ is re-triggered, the hard irq handler won't queue it again. But I might be wrong here, need some checks.
IIRC we added this loop before merging all interrupt handling in one thread, somehow the MSI mode never worked reliably without this change, so maybe we don't need this loop any longer.
I'd really prefer it if we didn't tie the RIRB handing change to this loop change, removing the loop should only be done with *a lot of testing*.