Hi Russell,
On Monday 04 August 2014 19:32:25 Russell King - ARM Linux wrote:
On Mon, Aug 04, 2014 at 08:03:45PM +0200, Lars-Peter Clausen wrote:
If the hardware has scatter gather support it allows the driver to chain the descriptors before submitting them, which reduces the latency between the transfers as well as the IO over overhead.
While partially true, that's not the full story...
BTW, you're talking about stuff in DMA engine not being clear, but you're using confusing terminology. Descriptors vs transactions. The prepare functions return a transaction. Descriptors are the hardware data structures which describe the transaction. I'll take what you're talking about above as "chain the previous transaction descriptors to the next transaction descriptors".
Well, the prep_* functions return a struct dma_async_tx_descriptor, documented as an "async transaction descriptor".
There are several types of descriptors, transactions and transfers involved, with different names depending on where you look at.
- At the highest level, we have the DMA engine representation of a transaction in the form of a struct dma_async_tx_descriptor (even this is slightly misleading, as tx is a common abbreviation of transmit or transmission, but not of transaction).
- One level lower, when the high level transaction targets non contiguous memory (from the device point of view) the transaction is split into contiguous chunks. The device might be able to execute a list (or table, depending of the implementation) of chunks on its own without requiring software intervention. If it isn't the driver will need to submit the next chunk in the completion interrupt of the previous chunk. Even when the device supports executing multiple chunks on its own, it might be limited in the number of chunks it can chain, requiring software intervention to handle one transaction descriptor.
- At the lowest level, the hardware will perform the transfer by repeating transfer cycles, reading a data unit from the source and writing to the destination. When the source or destination supports it, the read and/or write operations can also be grouped in bursts.
If we want to lower the confusion we should decide on names for those different levels and stick to them.
The highest level unit is called a transaction by (at least some parts of) the API, the name sounds good enough to me. "Transaction" could thus refer to that operation, and "transaction descriptor" to the struct dma_async_tx_descriptor instance.
We could then say that a transaction is split into transfers, each of them targeting a contiguous piece of memory of both the source and the destination, and that transfers are split into transfer cycles, each of them transferring one data unit or element. I'm also open to other proposals (including using the name "chunk" for one of the elements).
The flaw with the current implementation is that there is only one global chain per channel instead of e.g. having the possibility to build up a chain in a driver and then submit and start the chain.
Well, that's not completely true, the API supports scatterlists, so you could create a single transaction descriptor that spans several unrelated transfers (as long as they can use the same channel, for instance targeting the same device for slave transactions).
Some drivers have virtual channels where each channel basically acts as the chain and once issue pending is called it is the chain is mapped to a real channel which then executes it.
Most DMA engines are unable to program anything except the parameters for the next stage of the transfer. In order to switch between "channels", many DMA engine implementations need the help of the CPU to reprogram the physical channel configuration. Chaining two different channels which may ultimately end up on the same physical channel would be a bug in that case.
I'm mostly familiar with DMA engines designed for slave transfers. The ones I've seen have channels that are programmed and run independently, usually with some logic to arbitrate bus access. When they support executing lists or arrays of transfers the hardware transfer descriptors include the source and destination addresses and the number of elements to be transfered. The identifier of the slave device (basically the DMA request line to which the slave is connected) is constant across all chained transfers.
I'm not sure what you mean by "switching between channels". Could you please explain that ?
Where the real flaw exists is the way that a lot of people write their DMA engine drivers - in particular how they deal with the end of a transfer.
Many driver implementations receive an interrupt from the DMA controller, and either queue a tasklet, or they check the existing transfer, mark it as completed in some way, and queue a tasklet.
When the tasklet runs, they then look to see if there's another transfer which they can start, and they then start it.
That is horribly inefficient - it is much better to do all the DMA manipulation in IRQ context. So, when the channel completes the existing transfer, you move the transaction to the queue of completed transfers and queue the tasklet, check whether there's a transaction for the same channel pending, and if so, start it immediately.
This means that your inter-transfer gap is reduced down from the interrupt latency plus tasklet latency, to just the interrupt latency.
I totally agree. This should be documented to avoid this kind of mistake in the future. Maxime, if you can find time for it, could you add this to the next version of your documentation patch ?
Controllers such as OMAP (if their hardware scatter chains were used) do have the ability to reprogram the entire channel configuration from an appropriate transaction, and so /could/ start the next transfer entirely automatically - but I never added support for the hardware scatterlists as I have been told that TI measurements indicated that it did not gain any performance to use them. Had this been implemented, it would mean that OMAP would only need to issue an interrupt to notify completion of a transfer (so the driver would only have to work out how many dma transactions had been completed.)
In this case, it is important that we do batch up the entries (since an already in progress descriptor should not be modified), but I suspect in the case of slave DMA, it is rarely the case that there is more than one or two descriptors queued at any moment.
I agree, in most cases there's only one or a few transaction descriptors queued for slave DMA. There could then be a larger number of hardware transfer descriptors to represent one transaction descriptor, but those would have been created by the DMA engine driver from a single transaction descriptor, so there would be no problem chaining the transfers.
How about the memcpy (non-slave) DMA ? Do client drivers submit lots of small DMA transactions that should be chained for optimal performances ?