Unrecoverable AER error when resuming from RAM (hda regression in 5.7-rc2)
With 5.7-rc2, after resuming from suspend to RAM, I get:
[ 55.679382] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0 [ 55.679405] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 55.679410] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00100000/04400000 [ 55.679414] pcieport 0000:00:03.1: AER: [20] UnsupReq (First) [ 55.679417] pcieport 0000:00:03.1: AER: TLP Header: 40000004 0a0000ff fffc0e80 00000000 [ 55.679423] amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback) [ 55.679425] snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback) [ 55.679455] pcieport 0000:00:03.1: AER: device recovery failed
Then the display freezes and the system basically falls apart (can't even sudo reboot -f, need to use magic sysrq).
I bisected this to "ALSA: hda: Skip controller resume if not needed". Setting snd_hda_intel.power_save=0 resolves the issue.
I am using an ASRock B450 Pro4 with Realtek HDA codec:
[ 1.009400] snd_hda_intel 0000:0a:00.1: enabling device (0000 -> 0002) [ 1.009425] snd_hda_intel 0000:0a:00.1: Force to non-snoop mode [ 1.009653] snd_hda_intel 0000:0c:00.3: enabling device (0000 -> 0002) [ 1.021452] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x7, too many assigned pins [ 1.021461] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x9, too many assigned pins [ 1.021471] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xb, too many assigned pins [ 1.021480] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xd, too many assigned pins [ 1.021482] snd_hda_codec_generic hdaudioC0D0: autoconfig for Generic: line_outs=0 (0x0/0x0/0x0/0x0/0x0) type:line [ 1.021482] snd_hda_codec_generic hdaudioC0D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0) [ 1.021483] snd_hda_codec_generic hdaudioC0D0: hp_outs=0 (0x0/0x0/0x0/0x0/0x0) [ 1.021484] snd_hda_codec_generic hdaudioC0D0: mono: mono_out=0x0 [ 1.021484] snd_hda_codec_generic hdaudioC0D0: dig-out=0x3/0x5 [ 1.021485] snd_hda_codec_generic hdaudioC0D0: inputs: [ 1.046053] snd_hda_codec_realtek hdaudioC1D0: autoconfig for ALC892: line_outs=1 (0x14/0x0/0x0/0x0/0x0) type:line [ 1.046054] snd_hda_codec_realtek hdaudioC1D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0) [ 1.046055] snd_hda_codec_realtek hdaudioC1D0: hp_outs=1 (0x1b/0x0/0x0/0x0/0x0) [ 1.046055] snd_hda_codec_realtek hdaudioC1D0: mono: mono_out=0x0 [ 1.046056] snd_hda_codec_realtek hdaudioC1D0: inputs: [ 1.046057] snd_hda_codec_realtek hdaudioC1D0: Front Mic=0x19 [ 1.046058] snd_hda_codec_realtek hdaudioC1D0: Rear Mic=0x18 [ 1.046058] snd_hda_codec_realtek hdaudioC1D0: Line=0x1a
I also have an ASUS RX 480 graphics card with HDMI audio output.
On Tue, 21 Apr 2020 21:08:44 +0200, Alex Xu (Hello71) wrote:
With 5.7-rc2, after resuming from suspend to RAM, I get:
[ 55.679382] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0 [ 55.679405] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 55.679410] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00100000/04400000 [ 55.679414] pcieport 0000:00:03.1: AER: [20] UnsupReq (First) [ 55.679417] pcieport 0000:00:03.1: AER: TLP Header: 40000004 0a0000ff fffc0e80 00000000 [ 55.679423] amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback) [ 55.679425] snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback) [ 55.679455] pcieport 0000:00:03.1: AER: device recovery failed
Then the display freezes and the system basically falls apart (can't even sudo reboot -f, need to use magic sysrq).
I bisected this to "ALSA: hda: Skip controller resume if not needed". Setting snd_hda_intel.power_save=0 resolves the issue.
Hrm, it means the condition to skip the controller resume doesn't fit well. Does the patch below help?
But looking at the dmesg output:
[ 1.021452] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x7, too many assigned pins [ 1.021461] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x9, too many assigned pins [ 1.021471] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xb, too many assigned pins [ 1.021480] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xd, too many assigned pins [ 1.021482] snd_hda_codec_generic hdaudioC0D0: autoconfig for Generic: line_outs=0 (0x0/0x0/0x0/0x0/0x0) type:line [ 1.021482] snd_hda_codec_generic hdaudioC0D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0) [ 1.021483] snd_hda_codec_generic hdaudioC0D0: hp_outs=0 (0x0/0x0/0x0/0x0/0x0) [ 1.021484] snd_hda_codec_generic hdaudioC0D0: mono: mono_out=0x0 [ 1.021484] snd_hda_codec_generic hdaudioC0D0: dig-out=0x3/0x5 [ 1.021485] snd_hda_codec_generic hdaudioC0D0: inputs:
... it looks like snd-hda-codec-generic is used for HDMI/DP codec. This can't work well. Did you enable CONFIG_SND_HDA_HDMI?
In anyway, please give alsa-info.sh output. Run the script with --no-upload option and attach the output.
thanks,
Takashi
--- --- a/sound/pci/hda/hda_intel.c +++ b/sound/pci/hda/hda_intel.c @@ -1060,7 +1060,7 @@ static int azx_resume(struct device *dev)
/* check for the forced resume */ list_for_each_codec(codec, &chip->bus) { - if (hda_codec_need_resume(codec)) { + if (!codec->relaxed_resume) { forced_resume = true; break; }
[+cc Rafael, linux-pm]
On Tue, Apr 21, 2020 at 03:08:44PM -0400, Alex Xu (Hello71) wrote:
With 5.7-rc2, after resuming from suspend to RAM, I get:
[ 55.679382] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0 [ 55.679405] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 55.679410] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00100000/04400000 [ 55.679414] pcieport 0000:00:03.1: AER: [20] UnsupReq (First) [ 55.679417] pcieport 0000:00:03.1: AER: TLP Header: 40000004 0a0000ff fffc0e80 00000000 [ 55.679423] amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback) [ 55.679425] snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback) [ 55.679455] pcieport 0000:00:03.1: AER: device recovery failed
I'm not at all confident in my decoding skills, but I *think* the TLP header decodes to:
Fmt 010b 3 DW header with data (32-bit address) Type 00000b MWr Length 0x4 4 DW = 16 bytes Requester ID 0x0a00 0a:00.0 Byte enables 0xff Address 0xfffc0e80
which would mean the 0a:00.0 GPU did a 16-byte write to 0xfffc0e80, and the 00:03.1 Root Port reported that as an Unsupported Request. I don't know why that would be unless the address is invalid.
Maybe that's supposed to be an MSI address? Maybe a complete dmesg or /proc/iomem would have a clue?
I feel like this UR issue could be a PCI core issue or maybe some sort of misuse of PCI power management, but I can't seem to get traction on it.
Then the display freezes and the system basically falls apart (can't even sudo reboot -f, need to use magic sysrq).
I bisected this to "ALSA: hda: Skip controller resume if not needed". Setting snd_hda_intel.power_save=0 resolves the issue.
FWIW, the complete citation is c4c8dd6ef807 ("ALSA: hda: Skip controller resume if not needed"), https://git.kernel.org/linus/c4c8dd6ef807, which first appeared in v5.7-rc2.
I am using an ASRock B450 Pro4 with Realtek HDA codec:
[ 1.009400] snd_hda_intel 0000:0a:00.1: enabling device (0000 -> 0002) [ 1.009425] snd_hda_intel 0000:0a:00.1: Force to non-snoop mode [ 1.009653] snd_hda_intel 0000:0c:00.3: enabling device (0000 -> 0002) [ 1.021452] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x7, too many assigned pins [ 1.021461] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x9, too many assigned pins [ 1.021471] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xb, too many assigned pins [ 1.021480] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xd, too many assigned pins [ 1.021482] snd_hda_codec_generic hdaudioC0D0: autoconfig for Generic: line_outs=0 (0x0/0x0/0x0/0x0/0x0) type:line [ 1.021482] snd_hda_codec_generic hdaudioC0D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0) [ 1.021483] snd_hda_codec_generic hdaudioC0D0: hp_outs=0 (0x0/0x0/0x0/0x0/0x0) [ 1.021484] snd_hda_codec_generic hdaudioC0D0: mono: mono_out=0x0 [ 1.021484] snd_hda_codec_generic hdaudioC0D0: dig-out=0x3/0x5 [ 1.021485] snd_hda_codec_generic hdaudioC0D0: inputs: [ 1.046053] snd_hda_codec_realtek hdaudioC1D0: autoconfig for ALC892: line_outs=1 (0x14/0x0/0x0/0x0/0x0) type:line [ 1.046054] snd_hda_codec_realtek hdaudioC1D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0) [ 1.046055] snd_hda_codec_realtek hdaudioC1D0: hp_outs=1 (0x1b/0x0/0x0/0x0/0x0) [ 1.046055] snd_hda_codec_realtek hdaudioC1D0: mono: mono_out=0x0 [ 1.046056] snd_hda_codec_realtek hdaudioC1D0: inputs: [ 1.046057] snd_hda_codec_realtek hdaudioC1D0: Front Mic=0x19 [ 1.046058] snd_hda_codec_realtek hdaudioC1D0: Rear Mic=0x18 [ 1.046058] snd_hda_codec_realtek hdaudioC1D0: Line=0x1a
I also have an ASUS RX 480 graphics card with HDMI audio output.
On Wed, 22 Apr 2020 22:50:28 +0200, Bjorn Helgaas wrote:
[+cc Rafael, linux-pm]
On Tue, Apr 21, 2020 at 03:08:44PM -0400, Alex Xu (Hello71) wrote:
With 5.7-rc2, after resuming from suspend to RAM, I get:
[ 55.679382] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0 [ 55.679405] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 55.679410] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00100000/04400000 [ 55.679414] pcieport 0000:00:03.1: AER: [20] UnsupReq (First) [ 55.679417] pcieport 0000:00:03.1: AER: TLP Header: 40000004 0a0000ff fffc0e80 00000000 [ 55.679423] amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback) [ 55.679425] snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback) [ 55.679455] pcieport 0000:00:03.1: AER: device recovery failed
I'm not at all confident in my decoding skills, but I *think* the TLP header decodes to:
Fmt 010b 3 DW header with data (32-bit address) Type 00000b MWr Length 0x4 4 DW = 16 bytes Requester ID 0x0a00 0a:00.0 Byte enables 0xff Address 0xfffc0e80
which would mean the 0a:00.0 GPU did a 16-byte write to 0xfffc0e80, and the 00:03.1 Root Port reported that as an Unsupported Request. I don't know why that would be unless the address is invalid.
Maybe that's supposed to be an MSI address? Maybe a complete dmesg or /proc/iomem would have a clue?
I feel like this UR issue could be a PCI core issue or maybe some sort of misuse of PCI power management, but I can't seem to get traction on it.
Then the display freezes and the system basically falls apart (can't even sudo reboot -f, need to use magic sysrq).
I bisected this to "ALSA: hda: Skip controller resume if not needed". Setting snd_hda_intel.power_save=0 resolves the issue.
FWIW, the complete citation is c4c8dd6ef807 ("ALSA: hda: Skip controller resume if not needed"), https://git.kernel.org/linus/c4c8dd6ef807, which first appeared in v5.7-rc2.
Yes, and I posted the fix patch right now: https://lore.kernel.org/r/20200422203744.26299-1-tiwai@suse.de
The possible cause was the tricky resume code that both HD-audio controller (the parent PCI device) and the codec devices used.
At least the patch above seems working for the reporter's machine. Now we need a bit more testing before merging, but it looks promising, so far.
thanks,
Takashi
On Wed, Apr 22, 2020 at 11:25:04PM +0200, Takashi Iwai wrote:
On Wed, 22 Apr 2020 22:50:28 +0200, Bjorn Helgaas wrote:
... I feel like this UR issue could be a PCI core issue or maybe some sort of misuse of PCI power management, but I can't seem to get traction on it.
Then the display freezes and the system basically falls apart (can't even sudo reboot -f, need to use magic sysrq).
I bisected this to "ALSA: hda: Skip controller resume if not needed". Setting snd_hda_intel.power_save=0 resolves the issue.
FWIW, the complete citation is c4c8dd6ef807 ("ALSA: hda: Skip controller resume if not needed"), https://git.kernel.org/linus/c4c8dd6ef807, which first appeared in v5.7-rc2.
Yes, and I posted the fix patch right now: https://lore.kernel.org/r/20200422203744.26299-1-tiwai@suse.de
The possible cause was the tricky resume code that both HD-audio controller (the parent PCI device) and the codec devices used.
At least the patch above seems working for the reporter's machine. Now we need a bit more testing before merging, but it looks promising, so far.
Great, I'm glad you figured something out because I sure wasn't getting anywhere!
Maybe this is a tangent, but I can't figure out what snd_power_change_state() is doing. It *looks* like it's supposed to change the PCI power state, but I gave up trying to figure out where it actually touches the device.
It seems like sound has more magic in power management than other device types, which makes me wonder if we're not providing the right interfaces or something.
Bjorn
On Thu, 23 Apr 2020 01:21:27 +0200, Bjorn Helgaas wrote:
On Wed, Apr 22, 2020 at 11:25:04PM +0200, Takashi Iwai wrote:
On Wed, 22 Apr 2020 22:50:28 +0200, Bjorn Helgaas wrote:
... I feel like this UR issue could be a PCI core issue or maybe some sort of misuse of PCI power management, but I can't seem to get traction on it.
Then the display freezes and the system basically falls apart (can't even sudo reboot -f, need to use magic sysrq).
I bisected this to "ALSA: hda: Skip controller resume if not needed". Setting snd_hda_intel.power_save=0 resolves the issue.
FWIW, the complete citation is c4c8dd6ef807 ("ALSA: hda: Skip controller resume if not needed"), https://git.kernel.org/linus/c4c8dd6ef807, which first appeared in v5.7-rc2.
Yes, and I posted the fix patch right now: https://lore.kernel.org/r/20200422203744.26299-1-tiwai@suse.de
The possible cause was the tricky resume code that both HD-audio controller (the parent PCI device) and the codec devices used.
At least the patch above seems working for the reporter's machine. Now we need a bit more testing before merging, but it looks promising, so far.
Great, I'm glad you figured something out because I sure wasn't getting anywhere!
Maybe this is a tangent, but I can't figure out what snd_power_change_state() is doing. It *looks* like it's supposed to change the PCI power state, but I gave up trying to figure out where it actually touches the device.
Not really, it merely updates the internal state field stored in the sound card object, see in include/sound/core.h:
static inline void snd_power_change_state(struct snd_card *card, unsigned int state) { card->power_state = state; wake_up(&card->power_sleep); }
The sound API blocks the operation while suspend/resume explicitly with this card top-level signal.
thanks,
Takashi
participants (3)
-
Alex Xu (Hello71)
-
Bjorn Helgaas
-
Takashi Iwai