On Tue, 9 Jun 2020, Christoph Hellwig wrote:
Working theory is that CONFIG_DMA_NONCOHERENT_MMAP getting set is causing the error_code in the page fault path. Debugging with Alex off-thread we found that dma_{alloc,free}_from_pool() are not getting called from the new code in dma_direct_{alloc,free}_pages() and he has not enabled mem_encrypt.
While DMA_COHERENT_POOL absolutely should not select DMA_NONCOHERENT_MMAP (and you should send your patch either way), I don't think it is going to make a difference here, as DMA_NONCOHERENT_MMAP just means we allows mmaps even for non-coherent devices, and we do not support non-coherent devices on x86.
We haven't heard yet whether the disabling of DMA_NONCOHERENT_MMAP fixes Aaron's BUG(), and the patch included some other debugging hints that will be printed out in case it didn't, but I'll share what we figured out:
In 5.7, his config didn't have DMA_DIRECT_REMAP or DMA_REMAP (it did have GENERIC_ALLOCATOR already). AMD_MEM_ENCRYPT is set.
In Linus HEAD, AMD_MEM_ENCRYPT now selects DMA_COHERENT_POOL so it sets the two aforementioned options.
We also figured out that dma_should_alloc_from_pool() is always false up until the BUG(). So what else changed? Only the selection of DMA_REMAP and DMA_NONCOHERENT_MMAP.
The comment in the Kconfig about setting "an uncached bit in the pagetables" led me to believe it may be related to the splat he's seeing (reserved bit violation). So I suggested dropping DMA_NONCOHERENT_MMAP from his Kconfig for testing purposes.
If this option should not implicitly be set for DMA_COHERENT_POOL, then I assume we need yet another Kconfig option since DMA_REMAP selected it before and DMA_COHERENT_POOL selects DMA_REMAP :)
So do we want a DMA_REMAP_BUT_NO_DMA_NONCOHERENT_MMAP? Decouple DMA_REMAP from DMA_NONCOHERENT_MMAP and select the latter wherever the former was set (but not DMA_COHERENT_POOL)? Something else?