[alsa-devel] UHCI dropped audio samples / DMA starvation fix
Hello Clemens and Takashi (and list),
I've been living with USB audio dropping partial frames on UHCI controllers for quite a few years. I've finally found the cause (DMA starvation) and have a lightweight fix.
problem summary:
Playback (but not capture) on UHCI controllers randomly loses samples; they are skipped completely during playback. This happens anywhere from a few times a second to once an hour or so. These are not ALSA underruns; the UHCI host controller itself is reporting hard DMA starvation errors. ALSA ignores these error and continues as if nothing happened, and the UHCI linux driver does not log this error anywhere.
UHCI overlaps sending on the USB bus with DMAing the data from memory. It does not have all the data it needs to transmit a complete packet before it begins doing so. If the data it needs is in a dirty cache line on a CPU, the flush triggers after DMA is already started; the UHCI host controller has to wait for the flush before accessing it, and if the flush happens too late, UHCI aborts the packet. This is happening regularly on all the UHCI controllers I've tried thusfar.
My guess is that this issue never attracted much attention because a) every one knows isoch transfers are 'unreliable' and b) other transfer types automatically retry so they eventually succeed without any obvious sign that something went wrong (and they're usually smaller than one cacheline and so unaffected anyway).
Fix:
The fix for usbaudio is dead easy. Call:
clflush_cache_range(urb->transfer_buffer, urb->transfer_buffer_length);
as the last step of prepare_playback_urb() and prepare_nodata_playback_urb(). This completely corrects the DMA starvation behavior on my machines here.
Note that the clflush_cache_range() has nothing to do with maintaining coherency. The call is only to force the flush ahead of time to avoid the fatal latency spikes from on-demand flushes when DMA is already running.
The drawback to this fix is that it's in one high level driver (ALSA USB audio) instead of inside UHCI where it really belongs. Unfortunately, the URB abstraction gives us no reliable way to do it in the HCD, because other drivers don't necessarily pass a valid CPU-side buffer address when using URB_NO_TRANSFER_DMA_MAP.
Should I whip up some patches for usb audio for 2.6.x and 3.3.x and leave it at that, or should we discuss a bit more 'why isn't the kernel doing this for us?'
Monty
Monty Montgomery wrote:
UHCI overlaps sending on the USB bus with DMAing the data from memory. It does not have all the data it needs to transmit a complete packet before it begins doing so. If the data it needs is in a dirty cache line on a CPU, the flush triggers after DMA is already started; the UHCI host controller has to wait for the flush before accessing it,
I've heard that the latest Intel architecture allows DMA to and from(?) the L3 cache.
and if the flush happens too late, UHCI aborts the packet. This is happening regularly on all the UHCI controllers I've tried thusfar.
My guess is that this issue never attracted much attention because a) every one knows isoch transfers are 'unreliable'
FireWire controllers show how this is done: read entire packets into a FIFO, and do this several frames before sending. (Whether that works in practice depends on how big the FIFO is, and if the FIFO is shared with asynchronous packets to be transmitted.)
I've never heard of a USB controller being advertized as having a big FIFO.
The UHCI spec says: | The Host Controller processes the schedule one entry at a time (this | discussion does not preclude prefetching of schedule entries). but then: | The Host Controller fetches the next entry from the Frame List when | the millisecond allotted to the current frame expires.
... so it is not actually possible to prefetch packets in a useful way.
The EHCI spec does not mention prefetching; the v1.1 addendum ("Energy- efficient extensions") defines a mechanism to allow prefetching, but this is not support by Linux.
The fix for usbaudio is dead easy. Call:
clflush_cache_range(urb->transfer_buffer, urb->transfer_buffer_length);
... if you happen to be on an x86 architecture ...
as the last step of prepare_playback_urb() and prepare_nodata_playback_urb(). This completely corrects the DMA starvation behavior on my machines here.
Note that the clflush_cache_range() has nothing to do with maintaining coherency.
And this is the problem with portability; all the portable functions in <asm/cacheflush.h> are NOPs on cache-coherent architectures.
It might be possible to define a new function like flush_coherent_dcache_range_for_faster_dma(). Maybe this should be called from into usb_hcd_map_urb_for_dma() or from the arch implementation of dma_map_*() for DMA_TO_DEVICE. For the latter, this might be implemented as a new DMA attribute.
Unfortunately, the URB abstraction gives us no reliable way to do it in the HCD, because other drivers don't necessarily pass a valid CPU-side buffer address when using URB_NO_TRANSFER_DMA_MAP.
But the USB API requires this (any HCD might use PIO or use its own double-buffering scheme).
Should I whip up some patches for usb audio for 2.6.x and 3.3.x
It easily fixes an actual problem, so yes, it's definitely stable-worthy.
Regards, Clemens
Responding to the rest inline, but unfortunately the point is moot; after two days of running well in production and believing it fixed, one of my boxes is once again starving UHCI DMA even with the patch. So I did not in fact fix it.
I've heard that the latest Intel architecture allows DMA to and from(?) the L3 cache.
FTR, the machines I'm testing on are Core Duo, Core2 Duo and first-gen Nehalem.
| The Host Controller fetches the next entry from the Frame List when | the millisecond allotted to the current frame expires.
... so it is not actually possible to prefetch packets in a useful way.
That was my conclusion as well. I could find no hooks to do so. it also says:
|Host Controller fetches and decodes the Transfer Descriptor. Assuming a Host-to-Function transaction, the |Host Controller delays committing to the USB Transaction until the FIFO fills to an appropriate “trigger point”. |When this threshold is reached, the Host Controller can then begin issuing the Transaction Token.
Unfortunately, the URB abstraction gives us no reliable way to do it in the HCD, because other drivers don't necessarily pass a valid CPU-side buffer address when using URB_NO_TRANSFER_DMA_MAP.
But the USB API requires this (any HCD might use PIO or use its own double-buffering scheme).
Oh, it does? I hadn't been able to come to that conclusion from the docs I read. If so, then that's good.
Anyway, going to keep trying.
Monty
participants (2)
-
Clemens Ladisch
-
Monty Montgomery