Fwd: question about alsa tracepoints and alsa debugging
Hi all,
I have a rk3288 board I am working with, it has two ADCs and one DAC. The ADCs are connected via an FPGA to the rk3288. The DAC is directly connected to the I2S port of the rk3288. The rk3288 generates the master clock. I am using buildroot and busybox and kernel 5.9.12 (stable).
I have a driver which receives the ADC data from the FPGA via SDIO. When I run speaker-test in the background and use arecord to record simultaneously in the foreground, both arecord and speaker-test seem to lock up. When I run speaker-test alone I do not have a problem. The playback device uses the rockchip_i2s.c driver while the capture device uses my SDIO driver.
My driver based on dw_mmc.c calls snd_pcm_period_elapsed() for each buffer of frames that is ready. I can see that this is being called for a while when I start arecord in the foreground, until the lockup occurs. The moment I start arecord, the console output from speaker-test stops, so it's locked up by starting arecord. Previously this was working fine on kernel 4.11. However, since I updated to kernel 5.9.12 this no longer works. Has anything changed between 4.11 and 5.9.12 that could be the cause of the problem?
commands used for starting playback / trace capture and for recording - ./perf record -F 99 -g -e snd_pcm:* -e sched:sched_switch -e sched:sched_wakeup -e irq:* speaker-test -D hw:CARD=ak4458ak5558aud,DEV=0 -c8 -r192000 -FS32_LE -tsine & arecord -Dhw:CARD=ak4458ak5558aud,DEV=1 -fS32_LE -c16 -r192000 /mnt/ramdisk/rec.wav
I enabled the kernel options as described at https://www.kernel.org/doc/html/v5.9/sound/designs/tracepoints.html and have used perf to record a trace, which can be access here - https://kernel-debugging.s3-us-west-1.amazonaws.com/perf.data.script.2112202... (exported using "perf script")
The trace file shows "lockdep_recursion" at some point so I am wondering if my problem has to do with a deadlock somehow being caused. However, I also see handle_mm_fault, and do_page_fault, which has to do with memory mapping? So I'm not sure what the problem exactly is and what to do next to find out what's wrong.
Thanks in advance for your help and interest and patience, Bert
Hi all,
I have a rk3288 board I am working with, it has two ADCs and one DAC. The ADCs are connected via an FPGA to the rk3288. The DAC is directly connected to the I2S port of the rk3288. The rk3288 generates the master clock. I am using buildroot and busybox and kernel 5.9.12 (stable).
I have a driver which receives the ADC data from the FPGA via SDIO. When I run speaker-test in the background and use arecord to record simultaneously in the foreground, both arecord and speaker-test seem to lock up. When I run speaker-test alone I do not have a problem. The playback device uses the rockchip_i2s.c driver while the capture device uses my SDIO driver.
My driver based on dw_mmc.c calls snd_pcm_period_elapsed() for each buffer of frames that is ready. I can see that this is being called for a while when I start arecord in the foreground, until the lockup occurs. The moment I start arecord, the console output from speaker-test stops, so it's locked up by starting arecord. Previously this was working fine on kernel 4.11. However, since I updated to kernel 5.9.12 this no longer works. Has anything changed between 4.11 and 5.9.12 that could be the cause of the problem?
commands used for starting playback / trace capture and for recording - ./perf record -F 99 -g -e snd_pcm:* -e sched:sched_switch -e sched:sched_wakeup -e irq:* speaker-test -D hw:CARD=ak4458ak5558aud,DEV=0 -c8 -r192000 -FS32_LE -tsine & arecord -Dhw:CARD=ak4458ak5558aud,DEV=1 -fS32_LE -c16 -r192000 /mnt/ramdisk/rec.wav
I enabled the kernel options as described at https://www.kernel.org/doc/html/v5.9/sound/designs/tracepoints.html and have used perf to record a trace, which can be access here - https://kernel-debugging.s3-us-west-1.amazonaws.com/perf.data.script.2112202... (exported using "perf script")
The trace file shows "lockdep_recursion" at some point so I am wondering if my problem has to do with a deadlock somehow being caused. However, I also see handle_mm_fault, and do_page_fault, which has to do with memory mapping? So I'm not sure what the problem exactly is and what to do next to find out what's wrong.
Thanks in advance for your help and interest and patience, Bert
On 12/23/20 5:31 AM, Bert Schiettecatte wrote:
Hi all,
I have a rk3288 board I am working with, it has two ADCs and one DAC. The ADCs are connected via an FPGA to the rk3288. The DAC is directly connected to the I2S port of the rk3288. The rk3288 generates the master clock. I am using buildroot and busybox and kernel 5.9.12 (stable).
I have a driver which receives the ADC data from the FPGA via SDIO. When I run speaker-test in the background and use arecord to record simultaneously in the foreground, both arecord and speaker-test seem to lock up. When I run speaker-test alone I do not have a problem. The playback device uses the rockchip_i2s.c driver while the capture device uses my SDIO driver.
My driver based on dw_mmc.c calls snd_pcm_period_elapsed() for each buffer of frames that is ready. I can see that this is being called for a while when I start arecord in the foreground, until the lockup occurs. The moment I start arecord, the console output from speaker-test stops, so it's locked up by starting arecord.
Are these implemented as two separate sound cards? If so I don't see how they could interfere with each other. There should not be any shared resources or locks between them at least on the ALSA side.
Previously this was working fine on kernel 4.11. However, since I updated to kernel 5.9.12 this no longer works. Has anything changed between 4.11 and 5.9.12 that could be the cause of the problem?
There are almost 300k commits between v4.11 and v5.9.12 one of them surely could have added a new issue :)
If possible try to narrow this down a bit more, this will help to track this down. If this is a more general issue it is quite likely that this is a more recently introduced problem, otherwise others would have already run into it. Maybe try one of the stable kernels like v5.4.
commands used for starting playback / trace capture and for recording - ./perf record -F 99 -g -e snd_pcm:* -e sched:sched_switch -e sched:sched_wakeup -e irq:* speaker-test -D hw:CARD=ak4458ak5558aud,DEV=0 -c8 -r192000 -FS32_LE -tsine & arecord -Dhw:CARD=ak4458ak5558aud,DEV=1 -fS32_LE -c16 -r192000 /mnt/ramdisk/rec.wav
I enabled the kernel options as described at https://www.kernel.org/doc/html/v5.9/sound/designs/tracepoints.html and have used perf to record a trace, which can be access here - https://kernel-debugging.s3-us-west-1.amazonaws.com/perf.data.script.2112202... (exported using "perf script")
The trace file shows "lockdep_recursion" at some point so I am wondering if my problem has to do with a deadlock somehow being caused.
lockdep_recursion is not even a function. I think this is just an incorrectly resolved stack trace.
To check if there is a locking issue somewhere compile your kernel with CONFIG_PROVE_LOCKING=y
Do both devices I2S and MMC use the same DMA? If so I could see things going wrong there since that would be shared resources in the DMA driver. If possible try to run the MMC driver without DMA and see what happens.
- Lars
Hi Lars-Peter
occurs. The moment I start arecord, the console output from speaker-test stops, so it's locked up by starting arecord.
Are these implemented as two separate sound cards? If so I don't see how they could interfere with each other. There should not be any shared resources or locks between them at least on the ALSA side.
here are the definitions I've used in my driver - I had to translate my old code from kernel 4.11 to using the new macros so not sure if something got broken while doing that. I want my DAC and ADCs to be one sound card device.
SND_SOC_DAILINK_DEFS(ak4458, DAILINK_COMP_ARRAY(COMP_CPU("ff890000.i2s")), DAILINK_COMP_ARRAY(COMP_CODEC(NULL, "ak4458")), DAILINK_COMP_ARRAY(COMP_PLATFORM("ff890000.i2s")));
SND_SOC_DAILINK_DEFS(ak5558, DAILINK_COMP_ARRAY(COMP_CPU("audio-cpld")), DAILINK_COMP_ARRAY(COMP_CODEC(NULL, "ak5558")), DAILINK_COMP_ARRAY(COMP_PLATFORM("audio-cpld")));
enum { DAILINK_AK4458, DAILINK_AK5558, };
static struct snd_soc_dai_link rockchip_ak4458_ak5558_dai[] = { [DAILINK_AK4458] = { .name = "Audio out", .ops = &rockchip_ak4458_dai_ops, SND_SOC_DAILINK_REG(ak4458), }, [DAILINK_AK5558] = { .name = "Audio in", .ops = &rockchip_ak5558_dai_ops, SND_SOC_DAILINK_REG(ak5558), }, };
static struct snd_soc_card rockchip_ak4458_ak5558 = { .name = "ak4458-ak5558-audio", .dai_link = rockchip_ak4458_ak5558_dai, .num_links = ARRAY_SIZE(rockchip_ak4458_ak5558_dai), };
If possible try to narrow this down a bit more, this will help to track this down. If this is a more general issue it is quite likely that this is a more recently introduced problem, otherwise others would have already run into it. Maybe try one of the stable kernels like v5.4.
I will play around with a few different versions to see what happens.
To check if there is a locking issue somewhere compile your kernel with CONFIG_PROVE_LOCKING=y
do I need PROVE_RAW_LOCK_NESTING as well? what about PROVE_RCU and PROVE_RCU_LIST ?
Do both devices I2S and MMC use the same DMA? If so I could see things going wrong there since that would be shared resources in the DMA driver. If possible try to run the MMC driver without DMA and see what happens.
I think they both use DMA but I don't think the resources are shared, from what I can tell. My version of dw_mmc.c is 99% the same code but I just directly call my own callbacks in dw_mci_interrupt instead of dealing with the tasklet or work stuff.
Thanks for helping! Bert
Hi,
commands used for starting playback / trace capture and for recording - ./perf record -F 99 -g -e snd_pcm:* -e sched:sched_switch -e sched:sched_wakeup -e irq:* speaker-test -D hw:CARD=ak4458ak5558aud,DEV=0 -c8 -r192000 -FS32_LE -tsine & arecord -Dhw:CARD=ak4458ak5558aud,DEV=1 -fS32_LE -c16 -r192000 /mnt/ramdisk/rec.wav
I enabled the kernel options as described at https://www.kernel.org/doc/html/v5.9/sound/designs/tracepoints.html and have used perf to record a trace, which can be access here - https://kernel-debugging.s3-us-west-1.amazonaws.com/perf.data.script.2112202... (exported using "perf script")
The trace file shows "lockdep_recursion" at some point so I am wondering if my problem has to do with a deadlock somehow being caused.
According to the log, speaker-test starts PCM substream for playback direction with below parameters:
- bits per sample: 32 bit - samples per frame: 8 - frames per second: 192000 - frames per period: 4096 - frames per buffer: 16384 - ...
By calculation with the above parameters, the hardware of I2S interface should be configured to generate harware interrupt every 21.3 msec.
I wrote a Python 3 script to check the period of interrupt with a focus on the record including call of 'snd_pcm_period_elapsed' in its backtrace: https://gist.github.com/takaswie/af2b93b58a1154903d6fbe1a6bee3c69
``` $ ./parse.py /tmp/perf.data.script.211220202054.txt | \ grep -B 3 -A 3 unexpected 73.2000 0.02132900 expected 73.2213 0.02133500 expected 73.2426 0.02135300 expected 110.0207 36.77807000 unexpected 110.0420 0.02133000 expected 110.0634 0.02133400 expected 110.0847 0.02132400 expected ```
I can find around 37 second gap in timestamp 110.0207. It means that the I2S driver failed to control RK3288 I2S interface for periodical hardware interrupt.
Furthermore, I can see following records after the gap. It means that speaker-test process successfully detected XRUN of the PCM substream, then recoverd it to continue processing PCM frames for playback expectedly. Thus the I2S driver seems to have few problems about dead-lock issue.
With records of tracepoints events for capture PCM substream which Bert's SDIO driver handles as well as for the playback PCM substream which speaker-test starts, we can get more details of runtimes.
P.S. diagrams for the case:
``` +-----------+ +----+ +------+ +-------+ |AK5558 x 2 |<-(?)->|FPGA| |RK3288| | AK4458| | (ADC) |<-(?)->| |<-(SDIO)->| |<-(I2S)->| (DAC) | +-----------+ +----+ +------+ +-------+ ```
Regards
Takashi Sakamoto
Hi Takashi-san
On Thu, Dec 24, 2020 at 5:00 PM Takashi Sakamoto o-takashi@sakamocchi.jp wrote:
With records of tracepoints events for capture PCM substream which Bert's SDIO driver handles as well as for the playback PCM substream which speaker-test starts, we can get more details of runtimes.
Thanks so much for your detailed explanation. I tried capturing a new trace for both speaker-test and arecord by running perf from a different console and keeping an eye on /var/log/messages and it's clear there is a deadlock, please see below. In my code I can see that spin_lock_irqsave() is being called in the dw_mci_interrupt() which then calls some other code which again calls spin_lock_irqsave() so I guess that is what the problem is, and I have to restructure my code to avoid that. What I don't understand, is why it says speaker-test is trying to acquire the lock, while it's arecord that is using the capture device to record (which uses my sdio driver). I guess it doesn't matter because playback and capture are a single device (?).
Thanks! Bert
[ 187.653313] [ 187.656519] ============================================ [ 187.663958] WARNING: possible recursive locking detected [ 187.671392] 5.9.12-00015-g3d36682d8cdf-dirty #74 Not tainted [ 187.679214] -------------------------------------------- [ 187.686613] speaker-test/193 is trying to acquire lock: [ 187.693886] ed337cbc (&host->irq_lock#2){-...}-{2:2}, at: dw_mci_set_drto+0xa0/0xf8 [ 187.703898] [ 187.703898] but task is already holding lock: [ 187.713189] ed337cbc (&host->irq_lock#2){-...}-{2:2}, at: dw_mci_interrupt+0x468/0x4e4 [ 187.723463] [ 187.723463] other info that might help us debug this: [ 187.733531] Possible unsafe locking scenario: [ 187.733531] [ 187.742727] CPU0 [ 187.746692] ---- [ 187.750624] lock(&host->irq_lock#2); [ 187.755998] lock(&host->irq_lock#2); [ 187.761326] [ 187.761326] *** DEADLOCK *** [ 187.761326] [ 187.771203] May be due to missing lock nesting notation [ 187.771203] [ 187.780964] 1 lock held by speaker-test/193: [ 187.786799] #0: ed337cbc (&host->irq_lock#2){-...}-{2:2}, at: dw_mci_interrupt+0x468/0x4e4 [ 187.797239] [ 187.797239] stack backtrace: [ 187.804121] CPU: 0 PID: 193 Comm: speaker-test Not tainted 5.9.12-00015-g3d36682d8cdf-dirty #74 [ 187.814905] Hardware name: Rockchip (Device Tree) [ 187.821204] [<c0311930>] (unwind_backtrace) from [<c030c200>] (show_stack+0x10/0x14) [ 187.830964] [<c030c200>] (show_stack) from [<c0f4b9b0>] (dump_stack+0xd0/0xf0) [ 187.840147] [<c0f4b9b0>] (dump_stack) from [<c03adc34>] (__lock_acquire+0xff0/0x3394) [ 187.850018] [<c03adc34>] (__lock_acquire) from [<c03b0c74>] (lock_acquire+0x134/0x56c) [ 187.859997] [<c03b0c74>] (lock_acquire) from [<c0f62900>] (_raw_spin_lock_irqsave+0x4c/0x94) [ 187.870578] [<c0f62900>] (_raw_spin_lock_irqsave) from [<c0d0fca8>] (dw_mci_set_drto+0xa0/0xf8) [ 187.881474] [<c0d0fca8>] (dw_mci_set_drto) from [<c0d10208>] (dw_mci_interrupt+0x484/0x4e4) [ 187.891976] [<c0d10208>] (dw_mci_interrupt) from [<c03bf90c>] (__handle_irq_event_percpu+0x70/0x484) [ 187.903363] [<c03bf90c>] (__handle_irq_event_percpu) from [<c03bfd50>] (handle_irq_event_percpu+0x30/0x8c) [ 187.915376] [<c03bfd50>] (handle_irq_event_percpu) from [<c03bfde4>] (handle_irq_event+0x38/0x5c) [ 187.926515] [<c03bfde4>] (handle_irq_event) from [<c03c46fc>] (handle_fasteoi_irq+0xcc/0x124) [ 187.937255] [<c03c46fc>] (handle_fasteoi_irq) from [<c03be990>] (generic_handle_irq+0x34/0x44) [ 187.948107] [<c03be990>] (generic_handle_irq) from [<c03befe4>] (__handle_domain_irq+0x7c/0xe8) [ 187.959073] [<c03befe4>] (__handle_domain_irq) from [<c07a1340>] (gic_handle_irq+0x58/0x9c) [ 187.969677] [<c07a1340>] (gic_handle_irq) from [<c0300f74>] (__irq_usr+0x54/0x80) [ 187.979345] Exception stack(0xec40ffb0 to 0xec40fff8) [ 187.986316] ffa0: bff0a126 b6ef7ef0 00000044 b6ef7670 [ 187.996843] ffc0: 00000000 00000000 b6bfea54 b6bfb008 00000003 b6bfb008 000252a0 00025140 [ 188.007419] ffe0: 000250b0 beb2484c 00012f44 b6ebc26c 60030010 ffffffff
Hi Takashi-san, Lars-Peter
clear there is a deadlock, please see below. In my code I can see that spin_lock_irqsave() is being called in the dw_mci_interrupt() which then calls some other code which again calls spin_lock_irqsave() so I guess that is what the problem is, and I have to restructure my code to avoid that.
I managed to solve my problem by removing the accidentally introduced extra locking call in dw_mci_set_drto(). So now everything seems to work smoothly.
Thanks again for your help! Bert
participants (3)
-
Bert Schiettecatte
-
Lars-Peter Clausen
-
Takashi Sakamoto