[PATCH 1/3] ALSA USB MIDI: Fix port starvation
In case of a multiport USB MIDI interface the lower numbered ports starve higher numbered ports when more than one port is busy transmitting data. The starvation as of the current code is complete and can be for an arbitrarily long time. Control from userspace is not possible without at least halving the theoretically possible device throughput. This is especially bad as there are now 16x16 interface products available.
An unpredicable and arbitrarily long latency is not acceptable. The loss of half of the throughput is not acceptable, either. Fair scheduling of all busy ports is required.
The patch below balances the actual USB output between all busy ports. It is done in such a way that the single port use performance before applying the patch is identical to the one after applying the patch. To get there, the snd_usbmidi_transmit_byte helper had to be converted to return output notification to allow for the 'repeat' shortcut.
The patch tries to avoid O(2) load increase by scaling the balancing loop to the ports that are actually busy as far as this is possible.
For the patch to properly work the wMaxPacketSize of the device must be large enough to allow for at least one MIDI event per port in a URB. If this constraint is not met, which is quite improbable, starvation will occur again. Though this could be fixed the likelyhood of such a device is so low that the additional overhead introduced for all other devices is not worth it. Current multi port MIDI interfaces do typically have 2^n output ports and 2^x as wMaxPacketSize where x>n.
For a four port USB MIDI interface with a wMaxPacketSize of 64 the maximum latency for any port is changed by the patch from indefinite to 107.52ms.
The patch affects the output of all class compliant USB MIDI interfaces. Users will typically experience either no change or increased response, depending on their use case.
Signed-off-by: Andreas Steinmetz ast@domdv.de
--- a/sound/usb/midi.c 2020-03-12 22:45:06.611877152 +0100 +++ b/sound/usb/midi.c 2020-03-13 02:33:21.392847930 +0100 @@ -554,7 +554,7 @@ static void snd_usbmidi_output_midiman_p /* * Converts MIDI commands to USB MIDI packets. */ -static void snd_usbmidi_transmit_byte(struct usbmidi_out_port *port, +static int snd_usbmidi_transmit_byte(struct usbmidi_out_port *port, uint8_t b, struct urb *urb) { uint8_t p0 = port->cable; @@ -563,6 +563,7 @@ static void snd_usbmidi_transmit_byte(st
if (b >= 0xf8) { output_packet(urb, p0 | 0x0f, b, 0, 0); + return 1; } else if (b >= 0xf0) { switch (b) { case 0xf0: @@ -585,20 +586,20 @@ static void snd_usbmidi_transmit_byte(st case 0xf6: output_packet(urb, p0 | 0x05, 0xf6, 0, 0); port->state = STATE_UNKNOWN; - break; + return 1; case 0xf7: switch (port->state) { case STATE_SYSEX_0: output_packet(urb, p0 | 0x05, 0xf7, 0, 0); - break; + return 1; case STATE_SYSEX_1: output_packet(urb, p0 | 0x06, port->data[0], 0xf7, 0); - break; + return 1; case STATE_SYSEX_2: output_packet(urb, p0 | 0x07, port->data[0], port->data[1], 0xf7); - break; + return 1; } port->state = STATE_UNKNOWN; break; @@ -619,7 +620,7 @@ static void snd_usbmidi_transmit_byte(st port->state = STATE_UNKNOWN; } output_packet(urb, p0, port->data[0], b, 0); - break; + return 1; case STATE_2PARAM_1: port->data[1] = b; port->state = STATE_2PARAM_2; @@ -633,7 +634,7 @@ static void snd_usbmidi_transmit_byte(st port->state = STATE_UNKNOWN; } output_packet(urb, p0, port->data[0], port->data[1], b); - break; + return 1; case STATE_SYSEX_0: port->data[0] = b; port->state = STATE_SYSEX_1; @@ -646,29 +647,49 @@ static void snd_usbmidi_transmit_byte(st output_packet(urb, p0 | 0x04, port->data[0], port->data[1], b); port->state = STATE_SYSEX_0; - break; + return 1; } } + return 0; }
static void snd_usbmidi_standard_output(struct snd_usb_midi_out_endpoint *ep, struct urb *urb) { int p; + int start = 0; + int total = 0x10; + int max = ep->max_transfer - 4; + + if (max < 0) + return; + + while (1) { + int first = -1; + int final = -1;
- /* FIXME: lower-numbered ports can starve higher-numbered ports */ - for (p = 0; p < 0x10; ++p) { - struct usbmidi_out_port *port = &ep->ports[p]; - if (!port->active) - continue; - while (urb->transfer_buffer_length + 3 < ep->max_transfer) { + for (p = start; p < total; ++p) { + struct usbmidi_out_port *port = &ep->ports[p]; uint8_t b; + if (!port->active) + continue; +repeat: if (snd_rawmidi_transmit(port->substream, &b, 1) != 1) { port->active = 0; - break; + continue; } - snd_usbmidi_transmit_byte(port, b, urb); + if (!snd_usbmidi_transmit_byte(port, b, urb)) + goto repeat; + if (urb->transfer_buffer_length > max) + return; + if (first == -1) + first = p; + final = p; } + if (final == -1) + return; + start = first; + total = final + 1; } }
Andreas Steinmetz wrote:
the snd_usbmidi_transmit_byte helper had to be converted to return output notification to allow for the 'repeat' shortcut.
Why not simply handle one MIDI byte per port in each iteration? It could be argued that single-byte MIDI commands are likely to be real- time messages and deserve to go first.
Current multi port MIDI interfaces do typically have 2^n output ports and 2^x as wMaxPacketSize where x>n.
The USB specification requires bulk endpoints to have a wMaxPacketSize value of 8/16/32/64 for full speed, or exactly 512 for high speed.
For the patch to properly work the wMaxPacketSize of the device must be large enough to allow for at least one MIDI event per port in a URB.
There are devices that handle only the first four bytes of a received packet (because Windows used to send only small packets), and one of them, the ESI M4U, actually has more than one port.
My original idea for that FIXME was to use round robin until the packet is filled (or all ports are empty), and to store the next port index (where to start for the next packet) in the endpoint. This would be able to distribute the balancing over multiple packets.
Regards, Clemens
On Mon, 2020-03-16 at 13:03 +0100, Clemens Ladisch wrote:
Andreas Steinmetz wrote:
the snd_usbmidi_transmit_byte helper had to be converted to return output notification to allow for the 'repeat' shortcut.
Why not simply handle one MIDI byte per port in each iteration? It could be argued that single-byte MIDI commands are likely to be real- time messages and deserve to go first.
Actually the patch does exactly this. As soon as the helper signals that a message is complete the next port is processed. The "repeat" loop is just necessary to get a complete class compliant message for transfer as the helper processes one byte at a time. The range optimization is there to prevent O(2) performance cost if possible.
Current multi port MIDI interfaces do typically have 2^n output ports and 2^x as wMaxPacketSize where x>n.
The USB specification requires bulk endpoints to have a wMaxPacketSize value of 8/16/32/64 for full speed, or exactly 512 for high speed.
For the patch to properly work the wMaxPacketSize of the device must be large enough to allow for at least one MIDI event per port in a URB.
There are devices that handle only the first four bytes of a received packet (because Windows used to send only small packets), and one of them, the ESI M4U, actually has more than one port.
Again no problem as the patch is designed to prefer quirks over user settings to prevent malfunctioning devices. Quirk processing and thus size restrictions are processed after user selection.
My original idea for that FIXME was to use round robin until the packet is filled (or all ports are empty), and to store the next port index (where to start for the next packet) in the endpoint. This would be able to distribute the balancing over multiple packets.
The problem with "round robin until the packet is filled" is that in case of a large wMaxPacketSize there's then a huge latency. I just got for further internal testing an USB3 8x8 interface that uses a wMaxPacketSize of 512. In such a case a port doing a sysex transfer will delay another port then sending a realtime message by a minimum of 123ms per queued URB. Make that 7 URBs and you have a sysex queue of 860ms until the realtime message can be transmitted.
Regards, Clemens
Andreas Steinmetz wrote:
On Mon, 2020-03-16 at 13:03 +0100, Clemens Ladisch wrote:
Andreas Steinmetz wrote:
the snd_usbmidi_transmit_byte helper had to be converted to return output notification to allow for the 'repeat' shortcut.
Why not simply handle one MIDI byte per port in each iteration? It could be argued that single-byte MIDI commands are likely to be real- time messages and deserve to go first.
Actually the patch does exactly this. As soon as the helper signals that a message is complete the next port is processed. The "repeat" loop is just necessary to get a complete class compliant message for transfer as the helper processes one byte at a time.
Why is it necessary to immedately get a complete class-compliant message?
The range optimization is there to prevent O(2) performance cost if possible.
Do you mean O(n^2)? For what n?
My original idea for that FIXME was to use round robin until the packet is filled (or all ports are empty), and to store the next port index (where to start for the next packet) in the endpoint. This would be able to distribute the balancing over multiple packets.
The problem with "round robin until the packet is filled" is that in case of a large wMaxPacketSize there's then a huge latency.
This mechanism does not require using the original packet size. The point is that we should not restart with the first port in each packet.
Regards, Clemens
On Tue, 2020-03-17 at 11:26 +0100, Clemens Ladisch wrote:
Andreas Steinmetz wrote:
On Mon, 2020-03-16 at 13:03 +0100, Clemens Ladisch wrote:
Andreas Steinmetz wrote:
the snd_usbmidi_transmit_byte helper had to be converted to return output notification to allow for the 'repeat' shortcut.
Why not simply handle one MIDI byte per port in each iteration? It could be argued that single-byte MIDI commands are likely to be real- time messages and deserve to go first.
Actually the patch does exactly this. As soon as the helper signals that a message is complete the next port is processed. The "repeat" loop is just necessary to get a complete class compliant message for transfer as the helper processes one byte at a time.
Why is it necessary to immedately get a complete class-compliant message?
Maybe that's a misunderstanding caused by me using 'message' instead of 'event'. We need a complete MIDI event (or up to 3 sysex bytes) to form the class compliant 4 byte USB event. That's what the 'repeat' shortcut is for. And after one such event is collected to the URB the code switches to the next port.
The range optimization is there to prevent O(2) performance cost if possible.
Do you mean O(n^2)? For what n?
What I do mean is that in case of only one port being active we would do 16 full outer iterations over all ports to collect 1 event. And we need to do this for all events of that port. Even if we use the hint from port->active as a shortcut its always the full set of iterations.
Memorizing the first and the last port of the previous loop and then restricting to this set makes a full 16 port run in the first iteration and then only the active port for all subsequent iterations.
And yes, in the worst case of a 16x16 interface with ports 1 and 16 being active we're back to processing 16 ports for every iteration - I do, however, see this scenario as being quite theoretical in practice.
My original idea for that FIXME was to use round robin until the packet is filled (or all ports are empty), and to store the next port index (where to start for the next packet) in the endpoint. This would be able to distribute the balancing over multiple packets.
The problem with "round robin until the packet is filled" is that in case of a large wMaxPacketSize there's then a huge latency.
This mechanism does not require using the original packet size. The point is that we should not restart with the first port in each packet.
I was thinking about that but in practice I do not know any class compliant interface that does have a number of outputs that is not equal to 2^n, i.e. 1, 2, 4, 8, 16.
Then wMaxPacketSize is also always a 2^m value (16, 64, 512).
With m>=n+2 (I don't believe that there is any class compliant interface for which this is not true) we can always start at the first port while doing fair scheduling (when a new URB is filled all ports get an equal change if they do have events pending).
Implementing a restart based on the last packet scheduled for the very rare possibility of the above being not true just adds unnecessary complexity.
Regards, Clemens
Regards
Andreas Steinmetz wrote:
On Tue, 2020-03-17 at 11:26 +0100, Clemens Ladisch wrote:
Why is it necessary to immedately get a complete class-compliant message?
Maybe that's a misunderstanding caused by me using 'message' instead of 'event'. We need a complete MIDI event (or up to 3 sysex bytes) to form the class compliant 4 byte USB event. That's what the 'repeat' shortcut is for. And after one such event is collected to the URB the code switches to the next port.
Why is it necessary to get a complete USB event before switching to the next port? Why not just process one byte in each loop iteration?
The point is that we should not restart with the first port in each packet.
I was thinking about that but in practice I do not know any class compliant interface that does have a number of outputs that is not equal to 2^n, i.e. 1, 2, 4, 8, 16.
Then wMaxPacketSize is also always a 2^m value (16, 64, 512).
With m>=n+2 (I don't believe that there is any class compliant interface for which this is not true)
At least ESI M4U has n=2, m=2. It's for devices like this that the driver uses multiple packets. (I do not know if this (revision of the) device is still used.)
It's certainly true that using the full packet size and 7 packets is excessive in most situations. The driver should _automatically_ reduce these values when it is safe. (I'm not saying that module parameters are completely superfluous, but they should never be necessary if the driver can avoid it.)
I think the lowest-hanging fruit are the hardcoded 0x10, and that port->active is not a single bitfield; the latter will make searching for active ports easier. I'll create patches for this.
Regards, Clemens
On Wed, 2020-03-18 at 10:35 +0100, Clemens Ladisch wrote:
Andreas Steinmetz wrote:
On Tue, 2020-03-17 at 11:26 +0100, Clemens Ladisch wrote:
Why is it necessary to immedately get a complete class-compliant message?
Maybe that's a misunderstanding caused by me using 'message' instead of 'event'. We need a complete MIDI event (or up to 3 sysex bytes) to form the class compliant 4 byte USB event. That's what the 'repeat' shortcut is for. And after one such event is collected to the URB the code switches to the next port.
Why is it necessary to get a complete USB event before switching to the next port? Why not just process one byte in each loop iteration?
Because of fairness. Let's assume a keyboard sends 6 note on messages, that's 6 3 byte events on a port, while another port just sends channel pressure which is two bytes and with running status only one byte. Instead of fair scheduling the new notes will be delayed by the channel pressure when doing only one byte per iteration. So the grouped together note on events would be torn apart further than necessary which may be audible.
The point is that we should not restart with the first port in each packet.
I was thinking about that but in practice I do not know any class compliant interface that does have a number of outputs that is not equal to 2^n, i.e. 1, 2, 4, 8, 16.
Then wMaxPacketSize is also always a 2^m value (16, 64, 512).
With m>=n+2 (I don't believe that there is any class compliant interface for which this is not true)
At least ESI M4U has n=2, m=2. It's for devices like this that the driver uses multiple packets. (I do not know if this (revision of the) device is still used.)
Well, in this rare case the patch wouldn't do any harm. The only thing is that there would be no improvement in behaviour to the previous code version for this special device.
Please note that I did take especially care to not increase cpu usage (on x86_64) because I wanted to make sure that (hopefully) there is no load increase for slow embedded systems.
It's certainly true that using the full packet size and 7 packets is excessive in most situations. The driver should _automatically_ reduce these values when it is safe. (I'm not saying that module parameters are completely superfluous, but they should never be necessary if the driver can avoid it.)
If automatic tuning is possible, fine. Manual override should still be possible, though, for the case the user knows better than an automatism for the use case of the user.
I think the lowest-hanging fruit are the hardcoded 0x10, and that port->active is not a single bitfield; the latter will make searching for active ports easier. I'll create patches for this.
I'm very interested.
Cheers
On Wed, 2020-03-18 at 10:35 +0100, Clemens Ladisch wrote:
I think the lowest-hanging fruit are the hardcoded 0x10, and that port->active is not a single bitfield; the latter will make searching for active ports easier. I'll create patches for this.
I've created a small test utility that may be useful to test patches. See https://github.com/not1337/rawmidiperf
BTW: Have you you already have something to look at?
Regards Andreas
participants (2)
-
Andreas Steinmetz
-
Clemens Ladisch