Hi,
[ and sorry, I had the bounces address in cc originally, routing reply to list as well ]
On Fri, 4 Nov 2022, Takashi Iwai wrote:
On Fri, 04 Nov 2022 15:56:50 +0100, Kai Vehmanen wrote:
I've been debugging a SOF DSP load fail on VT-D/IOMMU systems [1] and that brought me to these two commits from you: .. but in rare low-memory cases, the dma_alloc_noncontiguous() call will fail in core/memalloc.c and we go to the fallback path.
So we go to snd_dma_sg_fallback_alloc(), but here it seems something is missing. The pages are not allocated with DMA mapping API anymore, so IOMMU won't know about the memory and in our case the DSP load will fail to a IOMMU fault.
[...]
Hrm, that's a tough issue. Basically if dma_alloc_noncontiguous() fails, it's really out of memory -- at least under the situation where IOMMU is enabled. The fallback path was introduced for the case where there is no IOMMU and noncontiguous allocation fails from the beginning.
ack. This matches with the bug reports. These happen rarely and on systems with a lot of memory pressure. We have recently submitted a few features&fixes that will further reduce these allocs, so it probably is better outcome to have a plain out of memory error.
And, what makes things more complicated is that snd_dma_dev_alloc() cannot be used for SG pages when IOMMU is involved. We can't get the proper pages for mapping SG from the DMA address in that case; some systems may work with virt_to_page() but it's not guranteed to work.
Aa, ok, so we need to use dma_alloc_noncontiguous() to do this properly and if it returns ENOMEM, that's what it is then, right.
So, I believe there are two issues:
- The firmware allocation fails with dma_alloc_noncontiguous() with
IOMMU enabled
- The helper goes to the fallback path and occasionally it worked,
although the pages can't be used with IOMMU.
Basically when 1 happens, we should just return an error without going to fallback path. But it won't "fix" your case?
I think an explicit error would be best. The problem now is that the driver will think the allocation (and mapping to device) is fine and proceeds to program the hardware to use the address. This will then create an IOMMU fault down the line that is not so straighforward to recover from (worst case is that a full device level reset needs to be done). And given driver doesn't know it got a faulty mapping, it's hard to make the decision why the fault happened.
Br, Kai