Behringer WING usb audio - cyclic xruns dependent on periods/buffers
I've got a Behringer WING digital mixer which is equipped with a USB interface supporting 48 in, 48 out at 44.1 or 48kHz. It's plugged into a USB3 interface and for the most part it seems to work well as a class-compliant audio interface, but I'm struggling to eliminate xruns and it's starting to feel either driver or hardware quirk related issue.
I've done all the usual things -- I'm running a PREEMPT_RT kernel, set up the realtime priorities of Jack and the relevant USB IRQ to no avail. I'm doing most of the debugging remotely over ssh with X shutdown and most other processes on the system stopped. There's nothing plugged into the computer beyond network, keyboard, mouse and the audio interface.
I'm currently testing on kernel 5.10.0-rc5 with preempt-rt patches, but I've previously seen the same things on 5.4-rt (Debian backports) and 5.8 (self-compiled).
The xruns are characterised by bursts of "retire_capture_urb" warnings in the kernel logs (and in particular the frame status for everything in those URBs is EXDEV, meaning I think that the frames haven't been consumed by the time it's retired). I've patched sound/driver/pcm.c to provide me with a bit more debugging information and it looks like usually it's almost always between 7-10 consecutive calls to retire_capture_urb that are affected.
The bursts of retire_capture_urb warnings seem roughly cyclic, with a cycle time that is dependent on the combination of the frames/period setting and the number of periods per buffer, though it doesn't appear to be a strictly linear thing: 512/2 (~170s cycle); 256/2 (~94s); 128/3 (~65s); 64/13 (~270s). I can't immediately see any definitive smoking gun in the urb or interrupt counts, but there is some grouping in the timings between the xruns. Taking 64 frames, 5 periods, it's usually 50s, but sometimes 60s:
21:47:57 50s since xrun 4 21:48:47 50s since xrun 5 21:49:46 59s since xrun 6 21:50:46 60s since xrun 7 21:51:36 50s since xrun 8 21:52:36 60s since xrun 9 21:53:36 60s since xrun 10 21:54:37 61s since xrun 11 21:55:27 50s since xrun 12 21:56:27 60s since xrun 13 21:57:27 60s since xrun 14 21:58:17 50s since xrun 15 21:59:16 59s since xrun 16 22:00:06 50s since xrun 17 22:00:56 50s since xrun 18 ...
If I count the URB numbers (sorry, I'm shaky on the terminology, but I'm counting each URB which is retired) I often (but not always) see the same number of URBs passing between xrun bursts. Currently at 64 frames/5 periods, it's often exactly 50042 or 49923 urbs between bursts.
The delay between starting Jack and encountering the first burst is not predictable -- it's not just a whole cycle -- but stopping jack for e.g 30s then starting again delays the next burst by the same amount. So it does seem related to something in the audio streaming rather than anything else going on on the system interfering. It's as if something is slowly slipping out of sync between the Wing and the kernel until they need to resync, but I don't really know enough about USB to have any deeper insight.
All the tests above happen to be at 44.1kHz with the Wing set to 48 in, 48 out, but changing to 48kHz or 2/2 IO doesn't seem to cure anything.
One more data point which may not be relevant: if I switch off or unplug the Wing while Jack is running, the system gradually locks up over a period of a few seconds. I can get something out of the console briefly afterwards but within about ten seconds it's completely unresponsive. I imagine that pulling out a USB device from a realtime thread isn't a kind thing to do, but I don't recall this happening with other interfaces.
Any help or insights into where I should be looking (I'm a newbie in kernel space) appreciated. If someone wants remote access to the box to investigate in realtime we could probably figure something out.
bjb
On Thu, 26 Nov 2020 19:06:04 +0100, Ben Bell wrote:
I've got a Behringer WING digital mixer which is equipped with a USB interface supporting 48 in, 48 out at 44.1 or 48kHz. It's plugged into a USB3 interface and for the most part it seems to work well as a class-compliant audio interface, but I'm struggling to eliminate xruns and it's starting to feel either driver or hardware quirk related issue.
I've done all the usual things -- I'm running a PREEMPT_RT kernel, set up the realtime priorities of Jack and the relevant USB IRQ to no avail. I'm doing most of the debugging remotely over ssh with X shutdown and most other processes on the system stopped. There's nothing plugged into the computer beyond network, keyboard, mouse and the audio interface.
I'm currently testing on kernel 5.10.0-rc5 with preempt-rt patches, but I've previously seen the same things on 5.4-rt (Debian backports) and 5.8 (self-compiled).
The xruns are characterised by bursts of "retire_capture_urb" warnings in the kernel logs (and in particular the frame status for everything in those URBs is EXDEV, meaning I think that the frames haven't been consumed by the time it's retired). I've patched sound/driver/pcm.c to provide me with a bit more debugging information and it looks like usually it's almost always between 7-10 consecutive calls to retire_capture_urb that are affected.
The bursts of retire_capture_urb warnings seem roughly cyclic, with a cycle time that is dependent on the combination of the frames/period setting and the number of periods per buffer, though it doesn't appear to be a strictly linear thing: 512/2 (~170s cycle); 256/2 (~94s); 128/3 (~65s); 64/13 (~270s). I can't immediately see any definitive smoking gun in the urb or interrupt counts, but there is some grouping in the timings between the xruns. Taking 64 frames, 5 periods, it's usually 50s, but sometimes 60s:
21:47:57 50s since xrun 4 21:48:47 50s since xrun 5 21:49:46 59s since xrun 6 21:50:46 60s since xrun 7 21:51:36 50s since xrun 8 21:52:36 60s since xrun 9 21:53:36 60s since xrun 10 21:54:37 61s since xrun 11 21:55:27 50s since xrun 12 21:56:27 60s since xrun 13 21:57:27 60s since xrun 14 21:58:17 50s since xrun 15 21:59:16 59s since xrun 16 22:00:06 50s since xrun 17 22:00:56 50s since xrun 18 ...
If I count the URB numbers (sorry, I'm shaky on the terminology, but I'm counting each URB which is retired) I often (but not always) see the same number of URBs passing between xrun bursts. Currently at 64 frames/5 periods, it's often exactly 50042 or 49923 urbs between bursts.
The delay between starting Jack and encountering the first burst is not predictable -- it's not just a whole cycle -- but stopping jack for e.g 30s then starting again delays the next burst by the same amount. So it does seem related to something in the audio streaming rather than anything else going on on the system interfering. It's as if something is slowly slipping out of sync between the Wing and the kernel until they need to resync, but I don't really know enough about USB to have any deeper insight.
All the tests above happen to be at 44.1kHz with the Wing set to 48 in, 48 out, but changing to 48kHz or 2/2 IO doesn't seem to cure anything.
One more data point which may not be relevant: if I switch off or unplug the Wing while Jack is running, the system gradually locks up over a period of a few seconds. I can get something out of the console briefly afterwards but within about ten seconds it's completely unresponsive. I imagine that pulling out a USB device from a realtime thread isn't a kind thing to do, but I don't recall this happening with other interfaces.
Any help or insights into where I should be looking (I'm a newbie in kernel space) appreciated. If someone wants remote access to the box to investigate in realtime we could probably figure something out.
In general you should avoid 44.1kHz if you want a small period size for a realtime process on USB-audio. With 44.1kHz, the packet size can't be fixed in integer, and the ISO transfer requires variable packet sizes. OTOH, ALSA API requires the fixed period size, hence it'll lead to inconsistencies occasionally.
Takashi
On Fri, Nov 27, 2020 at 10:20:59AM +0100, Takashi Iwai wrote:
On Thu, 26 Nov 2020 19:06:04 +0100, Ben Bell wrote:
I've got a Behringer WING digital mixer which is equipped with a USB interface supporting 48 in, 48 out at 44.1 or 48kHz. It's plugged into a USB3 interface and for the most part it seems to work well as a class-compliant audio interface, but I'm struggling to eliminate xruns and it's starting to feel either driver or hardware quirk related issue.
...
In general you should avoid 44.1kHz if you want a small period size for a realtime process on USB-audio. With 44.1kHz, the packet size can't be fixed in integer, and the ISO transfer requires variable packet sizes. OTOH, ALSA API requires the fixed period size, hence it'll lead to inconsistencies occasionally.
Thanks Takashi, that's useful to know. Most of my historical projects are at 44.1kHz which is why I'm using that here, but I'll try to switch to 48kHz in future work to avoid this. That said, I've tested a little with 48kHz and had similar problems. I'm currently encountering a period of relative stability (an xrun every 20 minutes or so) so I don't want to touch anything until I've got some recording out of the way, but I'll switch to 48kHz for future testing too.
I've read people saying there are problems with USB3 and audio which go away when they switch to USB 2, so I've got a card in the post so we'll see whether that helps.
In the meantime, I've enabled xhci_hcd dynamic debugging and the xruns all coincide with events like this. I'm wondering whether this is actually more of a USB host issue rather than an ALSA one?
1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 00000000d9d39c02, dev 4, ep 0x81, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: // Ding dong! 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 000000001152b022, dev 4, ep 0x81, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 00000000aad0b004, dev 4, ep 0x81, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 00000000014bb8b3, dev 4, ep 0x81, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 0000000058d3a525, dev 4, ep 0x81, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 00000000095e4d66, dev 4, ep 0x81, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 00000000c5233f07, dev 4, ep 0x81, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 0000000058cfb2b9, dev 4, ep 0x81, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 00000000ecf776e8, dev 4, ep 0x81, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 0000000090c79b8f, dev 4, ep 0x81, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 00000000ad358790, dev 4, ep 0x81, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 000000008f26c7ed, dev 4, ep 0x1, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: // Ding dong! 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cancel URB 0000000020b0a760, dev 4, ep 0x1, starting at offset 0xcfbNNNNN 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Stopped on Transfer TRB for slot 1 ep 2 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Removing canceled TD starting at 0xcfbNNNNN (dma). 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Finding endpoint context 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cycle state = 0x1 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: New dequeue segment = 000000009fc8e305 (virtual) 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: New dequeue pointer = 0xcfbNNNNN (DMA) 84 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Removing canceled TD starting at 0xcfbNNNNN (dma). 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Set TR Deq Ptr cmd, new deq seg = 000000009fc8e305 (0xcfbNNNNN dma), new deq ptr = 00000000cfc75f63 (0xcfb42440 dma), new cycle = 1 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: // Ding dong! 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Stopped on Transfer TRB for slot 1 ep 1 38 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Removing canceled TD starting at 0xcfbNNNNN (dma). 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Finding endpoint context 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Cycle state = 0x0 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: New dequeue segment = 0000000070effd6e (virtual) 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: New dequeue pointer = 0xcfbNNNNN (DMA) 29 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Removing canceled TD starting at 0xcfbNNNNN (dma). 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Set TR Deq Ptr cmd, new deq seg = 0000000070effd6e (0xcfbNNNNN dma), new deq ptr = 0000000034521cdf (0xcfb77180 dma), new cycle = 0 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: // Ding dong! 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Successful Set TR Deq Ptr cmd, deq = @cfb42440 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Successful Set TR Deq Ptr cmd, deq = @cfb77180 32 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Miss service interval error for slot 1 ep 2, set skip flag 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Found td. Clear skip flag for slot 1 ep 2. 1 Nov 28 09:12:32 rowlf kernel: [nnn] retire_capture_urb: frames 0 - 7 of [138046467] still active: -18 1 Nov 28 09:12:32 rowlf kernel: [nnn] retire_capture_urb: ...a total of 4 bad urbs; between [138046467] and [138046470]; kxrun = 238; 1120718 since last 41 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Miss service interval error for slot 1 ep 1, set skip flag 1 Nov 28 09:12:32 rowlf kernel: [nnn] xhci_hcd 0000:29:00.3: Found td. Clear skip flag for slot 1 ep 1.
(First column is a dedup count, and the retire_capture_urb messages are my own debug patch)
bjb
On Sat, Nov 28, 2020 at 09:36:00AM +0000, Ben Bell wrote:
In general you should avoid 44.1kHz if you want a small period size for a realtime process on USB-audio. With 44.1kHz, the packet size can't be fixed in integer, and the ISO transfer requires variable packet sizes. OTOH, ALSA API requires the fixed period size, hence it'll lead to inconsistencies occasionally.
So changing to 48kHz had no appreciable effect, but I'm working at that rate for now to eliminate the 44.1kHz weirdness from investigations. Using a USB2 card, also had no effect. Testing with a different USB audio interface (albeit a simple stereo one) didn't exhibit the behaviour, even when I took buffer sizes right down.
I'm not sure where to go to get more information or where's the best place to ask for help. I'm happy to do the leg work, but I don't know enough about the kernel, alsa or USB to figure it out without some help.
Current question: what is the delay in /proc/asound/card1/pcm0c/sub0/status actually measuring? I'm assuming it's measured in samples? I've written something to scrape the stats out in a tight loop and report.
What I see is a cycle where the delay rises and then a chunk equal to the frames per period (or sometimes, earlier on fpp-48) is removed. That feels like chunks being read out of a buffer. All fine.
After a while though, the maximum delay we reach with each cycle is creeping up. It increases by one every few cycles (usually two or three, or three or four -- always oscillating between two values) but of it's still only being emptied by a full period's worth of samples each cycle. So the overall effect is the delay creeps up and up until it hits the buffer size and then we get an xrun.
Like I said in the initial email, it feels like some sort of clock drift problem, where we're managing very slowly to collect more samples than we're reading -- to the tune of about 1 extra every few cycles -- and nothing on the consumer side is ever managing to compensate for that. I'm not even sure how that sort of drift would be possible though. Seems surprising.
Does any of this sound suspicious, or for that matter completely normal? Any suggestions where should I be looking next?
bjb
participants (2)
-
Ben Bell
-
Takashi Iwai