Re: Crash in acpi_ns_validate_handle triggered by soundwire on Linux 5.10

4 Feb 2021


      czw., 4 lut 2021 o 13:11 Marcin Ślusarz marcin.slusarz@gmail.com napisał(a):
...
pon., 1 lut 2021 o 13:16 Marcin Ślusarz marcin.slusarz@gmail.com napisał(a):
...
pon., 1 lut 2021 o 12:43 Rafael J. Wysocki rafael@kernel.org napisał(a):
...
On Fri, Jan 29, 2021 at 9:03 PM Marcin Ślusarz marcin.slusarz@gmail.com wrote:
...
pt., 29 sty 2021 o 19:59 Marcin Ślusarz marcin.slusarz@gmail.com napisał(a):
...
czw., 28 sty 2021 o 15:32 Marcin Ślusarz marcin.slusarz@gmail.com napisał(a):
...
czw., 28 sty 2021 o 13:39 Rafael J. Wysocki rafael@kernel.org napisał(a):
> The only explanation for that I can think about (and which does not
> involve supernatural intervention so to speak) is a stack corruption
> occurring between these two calls in sdw_intel_acpi_cb().  IOW,
> something scribbles on the handle in the meantime, but ATM I have no
> idea what that can be.
I tried KASAN but it didn't find anything and kernel actually booted
successfully.
I investigated this and it looks like a compiler bug (or something nastier),
but I can't find where exactly registers get corrupted because if I add printks
the corruption seems on the printk side, but if I don't add them it seems
the value gets corrupted earlier.
(...)
...
I'm using gcc 10.2.1 from Debian testing.
Someone on IRC, after hearing only that "gcc miscompiles the kernel",
suggested disabling CONFIG_STACKPROTECTOR_STRONG.
It helped indeed and it matches my observations, so it's quite likely it
is the culprit.
What do we do now?
Figure out why the stack protection kicks in, I suppose.
The target object is not on the stack, so if the pointer to it is
valid (we need to verify somehow that it is indeed), dereferencing it
shouldn't cause the stack protection to trigger.
Well, the problem is not that stack protector finds something, but
the feature itself corrupts some registers.
I retract this statement.
Originally I based it on this piece of code:
   0xffffffff815781f0 <+35>:    mov    %r12,%rdx
   0xffffffff815781f3 <+38>:    mov    $0xffffffff81eca4c0,%rsi
   0xffffffff815781fa <+45>:    mov    $0xffffffff82146d46,%rdi
   0xffffffff81578201 <+52>:    call   0xffffffff818909f1 <printk>
   0xffffffff81578206 <+57>:    cmpb   $0xf,0x8(%r12)
where crash is on the last line and I supposedly could see the message
printed by printk with the correct value of %r12.
However, after attaching kgdb+kgdboe (it's so much pain...) to the kernel
I discovered that someting corrupts memory so much that the formatting
string becomes "", which means that I don't actually see the output of printk.
Oh crap, I can't reproduce it anymore. I might have tried this before
I disabled KALSR, which would explain why I've seen "" as a formatting
string. (because 0xffffffff82146d46 would not be the real address of it)