On Wed, 22 Apr 2020 22:50:28 +0200, Bjorn Helgaas wrote:
[+cc Rafael, linux-pm]
On Tue, Apr 21, 2020 at 03:08:44PM -0400, Alex Xu (Hello71) wrote:
With 5.7-rc2, after resuming from suspend to RAM, I get:
[ 55.679382] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0 [ 55.679405] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 55.679410] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00100000/04400000 [ 55.679414] pcieport 0000:00:03.1: AER: [20] UnsupReq (First) [ 55.679417] pcieport 0000:00:03.1: AER: TLP Header: 40000004 0a0000ff fffc0e80 00000000 [ 55.679423] amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback) [ 55.679425] snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback) [ 55.679455] pcieport 0000:00:03.1: AER: device recovery failed
I'm not at all confident in my decoding skills, but I *think* the TLP header decodes to:
Fmt 010b 3 DW header with data (32-bit address) Type 00000b MWr Length 0x4 4 DW = 16 bytes Requester ID 0x0a00 0a:00.0 Byte enables 0xff Address 0xfffc0e80
which would mean the 0a:00.0 GPU did a 16-byte write to 0xfffc0e80, and the 00:03.1 Root Port reported that as an Unsupported Request. I don't know why that would be unless the address is invalid.
Maybe that's supposed to be an MSI address? Maybe a complete dmesg or /proc/iomem would have a clue?
I feel like this UR issue could be a PCI core issue or maybe some sort of misuse of PCI power management, but I can't seem to get traction on it.
Then the display freezes and the system basically falls apart (can't even sudo reboot -f, need to use magic sysrq).
I bisected this to "ALSA: hda: Skip controller resume if not needed". Setting snd_hda_intel.power_save=0 resolves the issue.
FWIW, the complete citation is c4c8dd6ef807 ("ALSA: hda: Skip controller resume if not needed"), https://git.kernel.org/linus/c4c8dd6ef807, which first appeared in v5.7-rc2.
Yes, and I posted the fix patch right now: https://lore.kernel.org/r/20200422203744.26299-1-tiwai@suse.de
The possible cause was the tricky resume code that both HD-audio controller (the parent PCI device) and the codec devices used.
At least the patch above seems working for the reporter's machine. Now we need a bit more testing before merging, but it looks promising, so far.
thanks,
Takashi