Re: [alsa-devel] Query on Audio DMA using DMAEngine
Added alsa-devel to Cc.
On 06/28/2013 05:27 AM, Fernandes, Joel wrote:
Hi Lars,
Hope you are doing well.
I am implementing Cyclic DMA support in the EDMA driver that is used by Davinci and now newer TI SoCs. I am thinking once I am done I can plug it into the snd_dmaengine framework.
Currently however, the davinci-pcm code directly programs the EDMA. That is what I am working to replace with a single driver and adapt to the snd dmaengine framework. However, once the current code in davinci-pcm uses internal RAM as an intermediate step in the whole DMA process (First data is TX to IRAM from DRAM and then from DRAM to the audio device).
Do you have any ideas on how we can adapt to the framework, such that we can till use the IRAM? Are there any existing implementations out there that do something similar?
Hm, I guess using the snd_dmaengine_pcm helper functions here shouldn't be too hard. Using the generic snd_dmaengine_pcm driver will require some extensions to it though. The mmp platform (pxa/mmp-pcm.c) is also using some kind of on-chip memory, so having support for this in the generic driver certainly makes sense. For the chaining you'd probably have to extend the dmaengine framework, since this kind of interleaved mem-to-mem and mem-to-dev cyclic transfer is currently not possible.
I'm wondering though why do you need to copy the data to RAM first, is it not possible to map the IRAM to userspace?
- Lars
On 06/30/2013 02:06 PM, Lars-Peter Clausen wrote:
Added alsa-devel to Cc.
On 06/28/2013 05:27 AM, Fernandes, Joel wrote:
Hi Lars,
Hope you are doing well.
I am implementing Cyclic DMA support in the EDMA driver that is used by Davinci and now newer TI SoCs. I am thinking once I am done I can plug it into the snd_dmaengine framework.
Currently however, the davinci-pcm code directly programs the EDMA. That is what I am working to replace with a single driver and adapt to the snd dmaengine framework. However, once the current code in davinci-pcm uses internal RAM as an intermediate step in the whole DMA process (First data is TX to IRAM from DRAM and then from DRAM to the audio device).
Do you have any ideas on how we can adapt to the framework, such that we can till use the IRAM? Are there any existing implementations out there that do something similar?
Hm, I guess using the snd_dmaengine_pcm helper functions here shouldn't be too hard. Using the generic snd_dmaengine_pcm driver will require some extensions to it though. The mmp platform (pxa/mmp-pcm.c) is also using some kind of on-chip memory, so having support for this in the generic driver certainly makes sense. For the chaining you'd probably have to extend the dmaengine framework, since this kind of interleaved mem-to-mem and mem-to-dev cyclic transfer is currently not possible.
I'm wondering though why do you need to copy the data to RAM first, is it not possible to map the IRAM to userspace?
I've already built a cyclic DMA implementation into the EDMA driver for Davinci, without using the internal RAM. But that was for a 2.6.37 kernel.
For capture, the internal RAM ping pong only made things worse, not better. I really have no idea what problem it was supposed to solve.
The trouble with the current davinci driver is that the IRQ handler has a real-time requirement, it must finish before the next DMA block completes. This causes most of the buffer overruns on heavily loaded systems. It's easy to set up a cyclic chain of DMA transfers with the EDMA controller that continuously transfers data to the audio buffer. Once that is done, the completion IRQ can be used to periodically "trigger" user space, but it isn't time critical any more. The McASP has enough internal buffering to take care of any DDR latency issues.
With the cyclic DMA, I can capture 16 channels of 32-bit audio at 51kHz, simultaneously playback 2 channels and write the audio data to an SD card on the OMAP-L138. Before that change, it wasn't even possible to capture 4 channels without overruns.
I can mail you the 2.6.37 code, it isn't worthy for direct inclusion but may save you some time to figure things out.
Kind regards, Mike.
On 07/01/2013 01:10 AM, Mike Looijmans wrote:
On 06/30/2013 02:06 PM, Lars-Peter Clausen wrote:
Added alsa-devel to Cc.
On 06/28/2013 05:27 AM, Fernandes, Joel wrote:
Hi Lars,
Hope you are doing well.
I am implementing Cyclic DMA support in the EDMA driver that is used by Davinci and now newer TI SoCs. I am thinking once I am done I can plug it into the snd_dmaengine framework.
Currently however, the davinci-pcm code directly programs the EDMA. That is what I am working to replace with a single driver and adapt to the snd dmaengine framework. However, once the current code in davinci-pcm uses internal RAM as an intermediate step in the whole DMA process (First data is TX to IRAM from DRAM and then from DRAM to the audio device).
Do you have any ideas on how we can adapt to the framework, such that we can till use the IRAM? Are there any existing implementations out there that do something similar?
Hm, I guess using the snd_dmaengine_pcm helper functions here shouldn't be too hard. Using the generic snd_dmaengine_pcm driver will require some extensions to it though. The mmp platform (pxa/mmp-pcm.c) is also using some kind of on-chip memory, so having support for this in the generic driver certainly makes sense. For the chaining you'd probably have to extend the dmaengine framework, since this kind of interleaved mem-to-mem and mem-to-dev cyclic transfer is currently not possible.
I'm wondering though why do you need to copy the data to RAM first, is it not possible to map the IRAM to userspace?
I've already built a cyclic DMA implementation into the EDMA driver for Davinci, without using the internal RAM. But that was for a 2.6.37 kernel.
Great!
For capture, the internal RAM ping pong only made things worse, not better. I really have no idea what problem it was supposed to solve.
Interesting.
The trouble with the current davinci driver is that the IRQ handler has a real-time requirement, it must finish before the next DMA block completes. This causes most of the buffer overruns on heavily loaded systems.
But how do you get around not calling snd_pcm_period_elapsed in a time-sensitive fashion? Isn't it always time senstive, or maybe you mean the timing is a bit more relaxed (still sensitive though) as now the interrupt handler can its own time to finish as long as it finishes before the next interrupt comes.
If that's what you mean, then actually what you said is not true for the ping-pong implementation. because the DMA controller is programmed only *once* at the beginning for the ping-pong or IRAM case. It is just the way the ping-pong works, there is no need to program the DMA controller again and again every interrupt. On the other hand, fully agree that for the regular case the DMA controller has to be programmed for every period and this is what I guess makes it time sensitive, you could confirm.
It's easy to set up a cyclic chain of DMA transfers with the EDMA controller that continuously transfers data to the audio buffer. Once that is done, the completion IRQ can be used to periodically "trigger" user space, but it isn't time critical any more.
That makes a lot of sense.
The McASP has enough internal buffering to take care of any DDR latency issues.
Sure.
With the cyclic DMA, I can capture 16 channels of 32-bit audio at 51kHz, simultaneously playback 2 channels and write the audio data to an SD card on the OMAP-L138. Before that change, it wasn't even possible to capture 4 channels without overruns.
Sweet! Any particular reason why it wasn't merged in vs the existing ping-pong code?
I can mail you the 2.6.37 code, it isn't worthy for direct inclusion but may save you some time to figure things out.
Certainly could take a look. Could you share it? Thank you.
Thanks,
-Joel
On 07/02/2013 03:28 AM, Joel Fernandes wrote:
On 07/01/2013 01:10 AM, Mike Looijmans wrote:
With the cyclic DMA, I can capture 16 channels of 32-bit audio at 51kHz, simultaneously playback 2 channels and write the audio data to an SD card on the OMAP-L138. Before that change, it wasn't even possible to capture 4 channels without overruns.
Sweet! Any particular reason why it wasn't merged in vs the existing ping-pong code?
I've posted questions and other stuff concerning the McASP/OMAP1, but there was very little interest, so I supposed the chipset was on its way out and there wasn't any point in maintaining it.
I can mail you the 2.6.37 code, it isn't worthy for direct inclusion but may save you some time to figure things out.
Certainly could take a look. Could you share it? Thank you.
I attached the source files.
There are a lot of changes in the files that are product specific hacks to get my things working in the cheapest way possible (cheap meaning spending little effort).
I also included the "davinci-mcasp.c" code. Because the customer wanted a bigger buffer, I set up the McASP FIFO to transfer larger blocks. The DMA is limited to 64k words, so increasing the size of a "word" is a sure way to transfer more data. It's also good for bursting transfers, which I'm told the DDR memory likes much better. I was unable to measure any effect (positive nor negative) of that though. The larger blocks aren't needed if you're satisfied with much smaller buffers, which for normal purposed should be just fine.
Mike.
On Tue, Jul 02, 2013 at 08:02:00AM +0200, Mike Looijmans wrote:
On 07/02/2013 03:28 AM, Joel Fernandes wrote:
Sweet! Any particular reason why it wasn't merged in vs the existing ping-pong code?
I've posted questions and other stuff concerning the McASP/OMAP1, but there was very little interest, so I supposed the chipset was on its way out and there wasn't any point in maintaining it.
Did you CC the relevant maintainers and other people working on the code? You've not done so on this thread... if you only post to the list it's very likely that people won't see what you've sent.
On 07/02/2013 02:16 PM, Mark Brown wrote:
On Tue, Jul 02, 2013 at 08:02:00AM +0200, Mike Looijmans wrote:
On 07/02/2013 03:28 AM, Joel Fernandes wrote:
Sweet! Any particular reason why it wasn't merged in vs the existing ping-pong code?
I've posted questions and other stuff concerning the McASP/OMAP1, but there was very little interest, so I supposed the chipset was on its way out and there wasn't any point in maintaining it.
Did you CC the relevant maintainers and other people working on the code? You've not done so on this thread... if you only post to the list it's very likely that people won't see what you've sent.
I'm relatively new to Linux kernel programming. The key problem for me was - and still is - that there is so overwhelmingly much information available, that it's virtually impossible to find out things that would be obvious for long-time developers, like finding out who the maintainer for a piece of code is. I still don't know that, by the way. How do I find the "CC" list that I'm supposed to send bugs/suggestions/patches to for a given piece of code?
I guess that a document on kernel-driver-development-for-people-who-used-to-work-with-a-centrally-organized-OS-and-used-to-get-all-their-answers-from-them would help, but then again finding that particular document - or realizing that it even exists (it doesn't, does it?) - would be the next problem.
It's quite easy to find out how one goes about writing a driver, but the process surrounding it - such as finding whether such a driver already exists, where to go for technical advice and where to post the git patch for inclusion in mainline is something that no one seems to want to dwell on.
Sorry if I'm ranting here. Maybe in time I'll learn to behave...
Mike
On Tue, Jul 02, 2013 at 03:30:51PM +0200, Mike Looijmans wrote:
things that would be obvious for long-time developers, like finding out who the maintainer for a piece of code is. I still don't know that, by the way. How do I find the "CC" list that I'm supposed to send bugs/suggestions/patches to for a given piece of code?
MAINTAINERS and git log should give you a good guide - there's a script called get_maintainer.pl in the kernel which will help but shouldn't be 100% relied on. Basically just look at revision control history and see who's been working on the code.
I guess that a document on kernel-driver-development-for-people-who-used-to-work-with-a-centrally-organized-OS-and-used-to-get-all-their-answers-from-them would help, but then again finding that particular document - or realizing that it even exists (it doesn't, does it?) - would be the next problem.
That's what MAINTAINERS is there for.
It's quite easy to find out how one goes about writing a driver, but the process surrounding it - such as finding whether such a driver already exists, where to go for technical advice and where to post the git patch for inclusion in mainline is something that no one seems to want to dwell on.
Documentation/SubmittingPatches is a pretty good starting point.
Joel Fernandes wrote:
On 07/01/2013 01:10 AM, Mike Looijmans wrote:
The trouble with the current davinci driver is that the IRQ handler has a real-time requirement, it must finish before the next DMA block completes. This causes most of the buffer overruns on heavily loaded systems.
But how do you get around not calling snd_pcm_period_elapsed in a time-sensitive fashion?
To ensure that other stuff is completed first, snd_pcm_period_elapsed() could be called later from a tasklet. (snd_pcm_period_elapsed() calls the .pointer callback, which could be another source of delays depending on how much hardware accesses it does.)
Regards, Clemens
Hi Mike,
On 07/01/2013 01:10 AM, Mike Looijmans wrote: [..]
The trouble with the current davinci driver is that the IRQ handler has a real-time requirement, it must finish before the next DMA block completes. This
I looked into this a little more.
I think you are picturing the following:
DMA transfer -> IRQ has to complete -> DMA transfer -> IRQ has to complete.. etc.
This is not really true in the davinci-pcm driver, the normal case without IRAM works more like..
DMA ----> DMA ---> DMA \ \ \ __ IRQ __ IRQ __ IRQ
The only hard requirement is the IRQ handler much finish updating before the next DMA transfer, or we're in trouble. Is this what you mean by real-time requirement, or did you mean something else?
Either way I'm sure your multi-slot approach is superior, but I don't see how you can get away with not updating the DMA addresses on every IRQ with the current davinci-pcm or EDMA controller (Unless you use a complicated mechanism like ping-pong where the address updates take care of itself). If you are using a set of chained slots, you only have so many slots so you have to continuously change addresses of the slots at some point or the other for a large transfer.
Thanks,
-Joel
On 07/02/2013 05:33 AM, Joel Fernandes wrote:
Hi Mike,
On 07/01/2013 01:10 AM, Mike Looijmans wrote: [..]
The trouble with the current davinci driver is that the IRQ handler has a real-time requirement, it must finish before the next DMA block completes. This
I looked into this a little more.
I think you are picturing the following:
DMA transfer -> IRQ has to complete -> DMA transfer -> IRQ has to complete.. etc.
This is not really true in the davinci-pcm driver, the normal case without IRAM works more like..
DMA ----> DMA ---> DMA \ \ \ __ IRQ __ IRQ __ IRQ
The only hard requirement is the IRQ handler much finish updating before the next DMA transfer, or we're in trouble. Is this what you mean by real-time requirement, or did you mean something else?
Yep, that's what I meant. Because I was capturing 16 channels of 32-bit data, the DMA buffer would drain in mere milliseconds. Interrupt latency on the L138 is pretty bad, I've seen it take over 10ms to handle an IRQ on occasion.
But even with much lower loads, I got underruns when recording to SD card that I couldn't really explain. I noticed that the SD transfers took up a lot of DMA params (about 40), so maybe that was just causing too much work for the IRQ or DMA handler routines.
Either way I'm sure your multi-slot approach is superior, but I don't see how you can get away with not updating the DMA addresses on every IRQ with the current davinci-pcm or EDMA controller (Unless you use a complicated mechanism like ping-pong where the address updates take care of itself). If you are using a set of chained slots, you only have so many slots so you have to continuously change addresses of the slots at some point or the other for a large transfer.
I use a chain like this:
DMA1 -> DMA2 -> DMA... -> DMA1
This meant I had to use a DMA PARAM slot for every "period". The OMAP L138 has 128 of those slots, so it's no problem to use a bunch of them. Because the chain is cyclic, there is no need to update any DMA parameter while running. All that ALSA needs to do is empty the buffer before the cycle completes and the current position gets overwritten.
The IRQ handler is called after each DMA completion, but it's no problem if it isn't handled in time. It is only used to give the ALSA framework a gentle push that a period has been transferred. Completely missing a bunch of interrupts has no effect whatsoever.
Thanks,
You're welcome.
Mike.
On Tue, Jul 02, 2013 at 07:50:16AM +0200, Mike Looijmans wrote:
On 07/02/2013 05:33 AM, Joel Fernandes wrote:
But even with much lower loads, I got underruns when recording to SD card that I couldn't really explain. I noticed that the SD transfers took up a lot of DMA params (about 40), so maybe that was just causing too much work for the IRQ or DMA handler routines.
SD cards are generally just slow, it's possible it's just not able to keep up with the data you're throwing at it. Things like batching the writes up into large chunks can help here but you may just be hitting a genuine limit if you need to record for too long and don't have enough fast storage (like RAM) to buffer.
This meant I had to use a DMA PARAM slot for every "period". The OMAP L138 has 128 of those slots, so it's no problem to use a bunch of them. Because the chain is cyclic, there is no need to update any DMA parameter while running. All that ALSA needs to do is empty the buffer before the cycle completes and the current position gets overwritten.
This sort of cyclic thing tends to be best, ideally you don't need interrupts at all (other than a timer).
On 07/02/2013 02:13 PM, Mark Brown wrote:
On Tue, Jul 02, 2013 at 07:50:16AM +0200, Mike Looijmans wrote:
On 07/02/2013 05:33 AM, Joel Fernandes wrote:
But even with much lower loads, I got underruns when recording to SD card that I couldn't really explain. I noticed that the SD transfers took up a lot of DMA params (about 40), so maybe that was just causing too much work for the IRQ or DMA handler routines.
SD cards are generally just slow, it's possible it's just not able to keep up with the data you're throwing at it. Things like batching the writes up into large chunks can help here but you may just be hitting a genuine limit if you need to record for too long and don't have enough fast storage (like RAM) to buffer.
What i meant by "couldn't really explain" is that I monitored CPU, memory and IO, and could clearly conclude that the card (or network, or even /dev/null on occasion) wasn't the bottleneck in itself. It's not so much the medium, but more the load it causes on the system that triggered the overruns.
This meant I had to use a DMA PARAM slot for every "period". The OMAP L138 has 128 of those slots, so it's no problem to use a bunch of them. Because the chain is cyclic, there is no need to update any DMA parameter while running. All that ALSA needs to do is empty the buffer before the cycle completes and the current position gets overwritten.
This sort of cyclic thing tends to be best, ideally you don't need interrupts at all (other than a timer).
Indeed. In this case, the DMA completion IRQ is still useful because the DMA controller issues it after the "period" data has been transferred completely. Just monitoring the DMA registers will tell which data is currently being transferred, but you can't be sure that it has actually been committed. So if the DMA pointer is now at 0x0100, you cannot be sure whether the data at 0x0100 is already valid, or even the data at 0x0098, because that might still be in the controller's queue. The pointer increments when the request is sent to the queue, but there's no guarantee as to when the queue will actually be executed, because other transactions may have higher priority.
Mike.
On 07/02/2013 02:13 PM, Mark Brown wrote:
On Tue, Jul 02, 2013 at 07:50:16AM +0200, Mike Looijmans wrote:
On 07/02/2013 05:33 AM, Joel Fernandes wrote:
[...]
This meant I had to use a DMA PARAM slot for every "period". The OMAP L138 has 128 of those slots, so it's no problem to use a bunch of them. Because the chain is cyclic, there is no need to update any DMA parameter while running. All that ALSA needs to do is empty the buffer before the cycle completes and the current position gets overwritten.
This sort of cyclic thing tends to be best, ideally you don't need interrupts at all (other than a timer).
Yes, this is usually how it is done. But I'm wondering maybe the EDMA controller only has a small total amount of slots available.
- Lars
On Wed, Jul 03, 2013 at 11:09:22AM +0200, Lars-Peter Clausen wrote:
On 07/02/2013 02:13 PM, Mark Brown wrote:
This sort of cyclic thing tends to be best, ideally you don't need interrupts at all (other than a timer).
Yes, this is usually how it is done. But I'm wondering maybe the EDMA controller only has a small total amount of slots available.
Well, you don't need particularly many slots so long as you can cope with a large period size.
On 07/03/2013 11:43 AM, Mark Brown wrote:
On Wed, Jul 03, 2013 at 11:09:22AM +0200, Lars-Peter Clausen wrote:
On 07/02/2013 02:13 PM, Mark Brown wrote:
This sort of cyclic thing tends to be best, ideally you don't need interrupts at all (other than a timer).
Yes, this is usually how it is done. But I'm wondering maybe the EDMA controller only has a small total amount of slots available.
Well, you don't need particularly many slots so long as you can cope with a large period size.
On the OMAP L138, there are 128 PARAM slots. 32 of those are tied to hardware events (though you can use them if you aren't using the related hardware, for example the UART drivers don't use DMA so you can freely use those slots if you want), leaving (at least) 96 PARAM slots free. Both audio events are on the same controller, so you can't use the 128 of the other one (the OMAP has 2 EDMA controllers). Only a few dozen of those are being used by various drivers, the SD card driver being the most hungry. For the system to work, you can even get away with only using one slot, and hence one period, but then you'll have to use a mmap and a timer to fill it.
I experimented with various memory layouts. For large transfers, using 2 big periods was quite enough. I also tested with very small period sizes. Using the original code, I was unable to reliably capture (to /dev/null) at period sizes below 80 samples. With the cyclic DMA, I could set a period size of only 40 samples and still be able to record audio reliably, when using only 8 periods. The same for playback, basically. So that's how I arrived at the MAX_PERIODS define of "8". It will only claim channels when you use them, so setting it to say "100" will not crash the system.
The period size is limited by the EDMA parameter set. It can only transfer 64k-1 "words" per slot. You can (and should!) use the McASP FIFO buffer to increase the word size, thus allowing for period sizes in the megabyte range.
M.
Hi Mike,
On 07/03/2013 08:17 AM, Mike Looijmans wrote:
On 07/03/2013 11:43 AM, Mark Brown wrote:
On Wed, Jul 03, 2013 at 11:09:22AM +0200, Lars-Peter Clausen wrote:
On 07/02/2013 02:13 PM, Mark Brown wrote:
This sort of cyclic thing tends to be best, ideally you don't need interrupts at all (other than a timer).
Yes, this is usually how it is done. But I'm wondering maybe the EDMA controller only has a small total amount of slots available.
Well, you don't need particularly many slots so long as you can cope with a large period size.
On the OMAP L138, there are 128 PARAM slots. 32 of those are tied to hardware events (though you can use them if you aren't using the related hardware, for example the UART drivers don't use DMA so you can freely use those slots if you want), leaving (at least) 96 PARAM slots free. Both audio events are on the same controller, so you can't use the 128 of the other one (the OMAP has 2 EDMA controllers). Only a few dozen of those are being used by various drivers, the SD card driver being the most hungry. For the system to work, you can even get away with only using one slot, and hence one period, but then you'll have to use a mmap and a timer to fill it.
I experimented with various memory layouts. For large transfers, using 2 big periods was quite enough. I also tested with very small period sizes. Using the
Wouldn't very small periods take up too many interrupts, and also occupy lots of slots?
original code, I was unable to reliably capture (to /dev/null) at period sizes below 80 samples. With the cyclic DMA, I could set a period size of only 40 samples and still be able to record audio reliably, when using only 8 periods. The same for playback, basically. So that's how I arrived at the MAX_PERIODS define of "8". It will only claim channels when you use them, so setting it to say "100" will not crash the system.
Thanks for your post Mike. It makes more sense to me now. Correct me if I'm wrong but: - more the periods, more the granularity- but the drawback is you'd need more slots and too many interrupts; so we want fewer periods as many as we need. I still don't know though, how do we arrive at an acceptable number that userspace expects? - periods also will determine buffer size. Considering in future if we'd want to use IRAM as the buffer which is limited on some users of the davinci-pcm, there might not be enough buffer space.
So too many periods is certainly not a good thing. I wonder how we can arrive at what would constitute an acceptable number? As Linus said, "we never break userspace :P" so I'd rather not change anything that breaks someone's audio application.
I will post some RFC notes soon capture our discussion and other ideas I had put together for EDMA as some notes to summarize and get everyone's opinion. I will copy you on that as well. Thanks.
-Joel
Copying some more lists are we're also discussing the DMA controller in the SoCs. Thanks.
On 07/03/2013 04:43 AM, Mark Brown wrote:
On Wed, Jul 03, 2013 at 11:09:22AM +0200, Lars-Peter Clausen wrote:
On 07/02/2013 02:13 PM, Mark Brown wrote:
This sort of cyclic thing tends to be best, ideally you don't need interrupts at all (other than a timer).
Yes, this is usually how it is done. But I'm wondering maybe the EDMA controller only has a small total amount of slots available.
Well, you don't need particularly many slots so long as you can cope with a large period size.
Hi Mark,
When would it not be possible to cope with a large period size? Are there any guidelines on what to consider when fixing a period size?
I see tegra and aux1x go upto .period_bytes_min = 1024
About slots, following are no.of slots on some SoCs with EDMA:
am1808 - 96 slots available + 32 taken up for channel but can be reused with some changes. am335x - 172 slots available + 64 taken up for channels
On a slightly different note, about buffer_bytes_max, is there any drawback to setting it to a smaller value? Currently 128K is about what is used on davinci-pcm. My idea is to map to do the direct mapping to IRAM if the IRAM transfers are really what are preventing the under runs, but 128K will be too much for the buffer as we don't have that much IRAM infact it is just the boundary on am33xx (128K)
Thanks,
-Joel
On Wed, Jul 03, 2013 at 12:55:36PM -0500, Joel Fernandes wrote:
When would it not be possible to cope with a large period size? Are there any guidelines on what to consider when fixing a period size?
This is an application issue not a driver issue. An application that wants low latency may need high resolution information about what exactly the hardware is doing.
On 07/03/2013 08:12 PM, Mark Brown wrote:
On Wed, Jul 03, 2013 at 12:55:36PM -0500, Joel Fernandes wrote:
When would it not be possible to cope with a large period size? Are there any guidelines on what to consider when fixing a period size?
This is an application issue not a driver issue. An application that wants low latency may need high resolution information about what exactly the hardware is doing.
To get low-latency, the best thing from userspace is to mmap the audio buffer, and monitor the position of the DMA transfers. If the driver reports the DMA position accurately, you can get latencies of only a few samples. I must admit that I know next to nothing about how ALSA works in userspace, but that's how DirectSound works, for example. And from what I've seen, this is also possible with ALSA.
Even without that - I tried with small periods of only 40 samples, this invariably fails on the current driver, with or without the ping-ping. Using the cyclic DMA I had no problem using such small periods.
Mike.
On Thu, Jul 04, 2013 at 07:56:25AM +0200, Mike Looijmans wrote:
On 07/03/2013 08:12 PM, Mark Brown wrote:
This is an application issue not a driver issue. An application that wants low latency may need high resolution information about what exactly the hardware is doing.
To get low-latency, the best thing from userspace is to mmap the audio buffer, and monitor the position of the DMA transfers. If the driver reports the DMA position accurately, you can get latencies of only a few samples. I must admit that I know next to nothing about how ALSA works in userspace, but that's how DirectSound works, for example. And from what I've seen, this is also possible with ALSA.
There are often hardware limitations that mean that it is not possible to know the actual position of the DMA with anything less than period accuracy - either the hardware just doesn't report the current status during a transfer or it reports something that's not quite what's needed to usefully interact with it. The former is depressingly common. The APIs can support peering at the current position but it's not something that a portable application should be relying on.
Even without that - I tried with small periods of only 40 samples, this invariably fails on the current driver, with or without the ping-ping. Using the cyclic DMA I had no problem using such small periods.
The period size is generally orthogonal to decisions about using cyclic DMA.
On 07/03/2013 07:55 PM, Joel Fernandes wrote:
Copying some more lists are we're also discussing the DMA controller in the SoCs. Thanks.
On 07/03/2013 04:43 AM, Mark Brown wrote:
On Wed, Jul 03, 2013 at 11:09:22AM +0200, Lars-Peter Clausen wrote:
On 07/02/2013 02:13 PM, Mark Brown wrote:
This sort of cyclic thing tends to be best, ideally you don't need interrupts at all (other than a timer).
Yes, this is usually how it is done. But I'm wondering maybe the EDMA controller only has a small total amount of slots available.
Well, you don't need particularly many slots so long as you can cope with a large period size.
Hi Mark,
When would it not be possible to cope with a large period size? Are there any guidelines on what to consider when fixing a period size?
I see tegra and aux1x go upto .period_bytes_min = 1024
About slots, following are no.of slots on some SoCs with EDMA:
am1808 - 96 slots available + 32 taken up for channel but can be reused with some changes. am335x - 172 slots available + 64 taken up for channels
On a slightly different note, about buffer_bytes_max, is there any drawback to setting it to a smaller value? Currently 128K is about what is used on davinci-pcm. My idea is to map to do the direct mapping to IRAM if the IRAM transfers are really what are preventing the under runs, but 128K will be too much for the buffer as we don't have that much IRAM infact it is just the boundary on am33xx (128K)
In any case, using the IRAM directly might have some use, because you don't have to compete for the DDRRAM with other devices. But I never understood what the ping-ping via IRAM was supposed to accomplish, I don't see why McASP -> IRAM -> DDRRAM (or the other way around) would be better than just McASP -> DDRRAM. Especially since the McASP has a built-in 256 byte FIFO buffer on both channels. In all my measurements, using the IRAM ping-pong only made things worse in terms of overruns and underruns, not better.
Anyone who know why the ping-pong was implemented and what kind of usage it was intended for?
Mike.
On Thu, Jul 04, 2013 at 08:06:34AM +0200, Mike Looijmans wrote:
In any case, using the IRAM directly might have some use, because you don't have to compete for the DDRRAM with other devices. But I never understood what the ping-ping via IRAM was supposed to accomplish, I don't see why McASP -> IRAM -> DDRRAM (or the other way around) would be better than just McASP -> DDRRAM. Especially since the McASP has a built-in 256 byte FIFO buffer on both channels. In all my measurements, using the IRAM ping-pong only made things worse in terms of overruns and underruns, not better.
Anyone who know why the ping-pong was implemented and what kind of usage it was intended for?
Pushing the audio through some static RAM is normally implemented in order to save power - when doing this you can put the dynamic RAM into a lower power state for more of the time, only waking it up to burst data to or from the static RAM (assuming an otherwise idle system). This is more normally used for playback than for capture but the same idea applies in both cases.
On 7/4/2013 11:36 AM, Mike Looijmans wrote:
On 07/03/2013 07:55 PM, Joel Fernandes wrote:
Copying some more lists are we're also discussing the DMA controller in the SoCs. Thanks.
On 07/03/2013 04:43 AM, Mark Brown wrote:
On Wed, Jul 03, 2013 at 11:09:22AM +0200, Lars-Peter Clausen wrote:
On 07/02/2013 02:13 PM, Mark Brown wrote:
This sort of cyclic thing tends to be best, ideally you don't need interrupts at all (other than a timer).
Yes, this is usually how it is done. But I'm wondering maybe the EDMA controller only has a small total amount of slots available.
Well, you don't need particularly many slots so long as you can cope with a large period size.
Hi Mark,
When would it not be possible to cope with a large period size? Are there any guidelines on what to consider when fixing a period size?
I see tegra and aux1x go upto .period_bytes_min = 1024
About slots, following are no.of slots on some SoCs with EDMA:
am1808 - 96 slots available + 32 taken up for channel but can be reused with some changes. am335x - 172 slots available + 64 taken up for channels
On a slightly different note, about buffer_bytes_max, is there any drawback to setting it to a smaller value? Currently 128K is about what is used on davinci-pcm. My idea is to map to do the direct mapping to IRAM if the IRAM transfers are really what are preventing the under runs, but 128K will be too much for the buffer as we don't have that much IRAM infact it is just the boundary on am33xx (128K)
In any case, using the IRAM directly might have some use, because you don't have to compete for the DDRRAM with other devices. But I never understood what the ping-ping via IRAM was supposed to accomplish, I don't see why McASP -> IRAM -> DDRRAM (or the other way around) would be better than just McASP -> DDRRAM. Especially since the McASP has a built-in 256 byte FIFO buffer on both channels. In all my measurements, using the IRAM ping-pong only made things worse in terms of overruns and underruns, not better.
Anyone who know why the ping-pong was implemented and what kind of usage it was intended for?
McBSP peripheral that was included in the DaVinci devices like DM644x dis not come with FIFO. Due to latency of DDR accesses, there were channel swaps observed due to lost samples on these devices and IRAM implementation helped there.
Thanks, Sekhar
Hi Mike,
On 07/02/2013 12:50 AM, Mike Looijmans wrote: [..]
Either way I'm sure your multi-slot approach is superior, but I don't see how you can get away with not updating the DMA addresses on every IRQ with the current davinci-pcm or EDMA controller (Unless you use a complicated mechanism like ping-pong where the address updates take care of itself). If you are using a set of chained slots, you only have so many slots so you have to continuously change addresses of the slots at some point or the other for a large transfer.
I use a chain like this:
DMA1 -> DMA2 -> DMA... -> DMA1
This meant I had to use a DMA PARAM slot for every "period". The OMAP L138 has 128 of those slots, so it's no problem to use a bunch of them. Because the chain is cyclic, there is no need to update any DMA parameter while running. All that ALSA needs to do is empty the buffer before the cycle completes and the current position gets overwritten.
Replying to this thread after a long time but just wondering, how do you guarantee in your implementation that DMA will not empty the buffer faster than it is filled?
Thanks,
-Joel
On 08/13/2013 11:30 PM, Joel Fernandes wrote:
Hi Mike,
On 07/02/2013 12:50 AM, Mike Looijmans wrote: [..]
Either way I'm sure your multi-slot approach is superior, but I don't see how you can get away with not updating the DMA addresses on every IRQ with the current davinci-pcm or EDMA controller (Unless you use a complicated mechanism like ping-pong where the address updates take care of itself). If you are using a set of chained slots, you only have so many slots so you have to continuously change addresses of the slots at some point or the other for a large transfer.
I use a chain like this:
DMA1 -> DMA2 -> DMA... -> DMA1
This meant I had to use a DMA PARAM slot for every "period". The OMAP L138 has 128 of those slots, so it's no problem to use a bunch of them. Because the chain is cyclic, there is no need to update any DMA parameter while running. All that ALSA needs to do is empty the buffer before the cycle completes and the current position gets overwritten.
[Joel] Replying to this thread after a long time but just wondering, how do you guarantee in your implementation that DMA will not empty the buffer faster than it is filled?
I guess this is also what you've called in some threads as the overrun condition.
-Joel
On 08/14/2013 06:53 AM, Joel Fernandes wrote:
On 08/13/2013 11:30 PM, Joel Fernandes wrote:
Hi Mike,
On 07/02/2013 12:50 AM, Mike Looijmans wrote: [..]
Either way I'm sure your multi-slot approach is superior, but I don't see how you can get away with not updating the DMA addresses on every IRQ with the current davinci-pcm or EDMA controller (Unless you use a complicated mechanism like ping-pong where the address updates take care of itself). If you are using a set of chained slots, you only have so many slots so you have to continuously change addresses of the slots at some point or the other for a large transfer.
I use a chain like this:
DMA1 -> DMA2 -> DMA... -> DMA1
This meant I had to use a DMA PARAM slot for every "period". The OMAP L138 has 128 of those slots, so it's no problem to use a bunch of them. Because the chain is cyclic, there is no need to update any DMA parameter while running. All that ALSA needs to do is empty the buffer before the cycle completes and the current position gets overwritten.
[Joel] Replying to this thread after a long time but just wondering, how do you guarantee in your implementation that DMA will not empty the buffer faster than it is filled?
I guess this is also what you've called in some threads as the overrun condition.
Indeed. Alsa monitors the "position" of the ring, and when the DMA passes the application's "cursor", it reports an underrun or overrun (depending on whether it's capture or playback).
There is no guarantee - only verification. The user's application must keep up, or suffer the consequenses. My customer has been using the modified driver to capture 16 channels of 32-bit data at 50kHz for quite a while now. Before the modification, it wasn't even possible to reliably capture more than 4 channels.
Mike.
Met vriendelijke groet / kind regards,
Mike Looijmans
TOPIC Embedded Systems Eindhovenseweg 32-C, NL-5683 KH Best Postbus 440, NL-5680 AK Best Telefoon: (+31) – (0)499 - 33.69.79 Telefax: (+31) - (0)499 - 33.69.70 E-mail: mike.looijmans@topic.nl Website: www.topic.nl
Dit e-mail bericht en de eventueel daarbij behorende bijlagen zijn uitsluitend bestemd voor de geadresseerde, zoals die blijkt uit het e-mail bericht en/of de bijlagen. Er kunnen gegevens met betrekking tot een derde instaan. Indien u als niet-geadresseerde dit bericht en de bijlagen ontvangt, terwijl u niet bevoegd of gemachtigd bent om dit bericht namens de geadresseerde te ontvangen, wordt u verzocht de afzender hierover direct te informeren en het e-mail bericht met de bijlagen te vernietigen. Ieder gebruik van de inhoud van het e-mail bericht, waaronder de daarbij behorende bijlagen, door een ander dan de geadresseerde is onrechtmatig jegens ons dan wel de eventueel in het e-mail bericht of de bijlagen voorkomende andere personen. TOPIC Embedded Systems is niet aansprakelijk voor enigerlei schade voortvloeiend uit het gebruik en/of acceptatie van dit e-mail bericht of de daarbij behorende bijlagen.
The contents of this message, as well as any enclosures, are addressed personally to, and thus solely intended for the addressee. They may contain information regarding a third party. A recipient who is neither the addressee, nor empowered to receive this message on behalf of the addressee, is kindly requested to immediately inform the sender of receipt, and to destroy the message and the enclosures. Any use of the contents of this message and/or the enclosures by any other person than the addressee or person who is empowered to receive this message, is illegal towards the sender and/or the aforementioned third party. TOPIC Embedded Systems is not liable for any damage as a result of the use and/or acceptance of this message and as well as any enclosures.
On Tue, Aug 13, 2013 at 11:30:54PM -0500, Joel Fernandes wrote:
Replying to this thread after a long time but just wondering, how do you guarantee in your implementation that DMA will not empty the buffer faster than it is filled?
Userspace is ultimately responsible for supplying data - if the application can't keep up then the application should get an underrun reported and restart.
On 06/30/2013 07:06 AM, Lars-Peter Clausen wrote:
Added alsa-devel to Cc.
On 06/28/2013 05:27 AM, Fernandes, Joel wrote:
Hi Lars,
Hope you are doing well.
I am implementing Cyclic DMA support in the EDMA driver that is used by Davinci and now newer TI SoCs. I am thinking once I am done I can plug it into the snd_dmaengine framework.
Currently however, the davinci-pcm code directly programs the EDMA. That is what I am working to replace with a single driver and adapt to the snd dmaengine framework. However, once the current code in davinci-pcm uses internal RAM as an intermediate step in the whole DMA process (First data is TX to IRAM from DRAM and then from DRAM to the audio device).
Do you have any ideas on how we can adapt to the framework, such that we can till use the IRAM? Are there any existing implementations out there that do something similar?
Hm, I guess using the snd_dmaengine_pcm helper functions here shouldn't be too hard. Using the generic snd_dmaengine_pcm driver will require some extensions to it though. The mmp platform (pxa/mmp-pcm.c) is also using some kind of on-chip memory, so having support for this in the generic driver certainly
I quickly looked at the implementation there. That's neat the way IRAM is used to allocate the DMA buffer.
makes sense. For the chaining you'd probably have to extend the dmaengine framework, since this kind of interleaved mem-to-mem and mem-to-dev cyclic transfer is currently not possible.
I was thinking , if it makes sense to make this kind of intermediate IRAM step purely a DMA controller driver specific implementation. Basically, what I mean is the use of IRAM will be unknown to any of the other DMA layers and purely implement in the DMA controller driver making the interleaving with IRAM transparent to the DMAEngine framework or the other drivers. Using device tree or some other method, one could indicate that IRAM is present and should be used for the specific DMA channel.
I'm wondering though why do you need to copy the data to RAM first, is it not possible to map the IRAM to userspace?
Yes, certainly it should be possible to map the IRAM directly. I don't know the exact reasons why it was done that way but I do know not using the IRAM was causing under runs. I will run some experiments mapping IRAM directly to see if we still see under runs.
Thanks,
-Joel
On 07/02/2013 03:04 AM, Joel Fernandes wrote:
On 06/30/2013 07:06 AM, Lars-Peter Clausen wrote:
Added alsa-devel to Cc.
On 06/28/2013 05:27 AM, Fernandes, Joel wrote:
Hi Lars,
Hope you are doing well.
I am implementing Cyclic DMA support in the EDMA driver that is used by Davinci and now newer TI SoCs. I am thinking once I am done I can plug it into the snd_dmaengine framework.
Currently however, the davinci-pcm code directly programs the EDMA. That is what I am working to replace with a single driver and adapt to the snd dmaengine framework. However, once the current code in davinci-pcm uses internal RAM as an intermediate step in the whole DMA process (First data is TX to IRAM from DRAM and then from DRAM to the audio device).
Do you have any ideas on how we can adapt to the framework, such that we can till use the IRAM? Are there any existing implementations out there that do something similar?
Hm, I guess using the snd_dmaengine_pcm helper functions here shouldn't be too hard. Using the generic snd_dmaengine_pcm driver will require some extensions to it though. The mmp platform (pxa/mmp-pcm.c) is also using some kind of on-chip memory, so having support for this in the generic driver certainly
I quickly looked at the implementation there. That's neat the way IRAM is used to allocate the DMA buffer.
makes sense. For the chaining you'd probably have to extend the dmaengine framework, since this kind of interleaved mem-to-mem and mem-to-dev cyclic transfer is currently not possible.
I was thinking , if it makes sense to make this kind of intermediate IRAM step purely a DMA controller driver specific implementation. Basically, what I mean is the use of IRAM will be unknown to any of the other DMA layers and purely implement in the DMA controller driver making the interleaving with IRAM transparent to the DMAEngine framework or the other drivers. Using device tree or some other method, one could indicate that IRAM is present and should be used for the specific DMA channel.
Putting the ping-pong buffer handling into the DMA driver would allow you to re-implement the current functionality with the current dmaengine API. So this sounds like an option. And maybe there are also other usecases besides audio for this.
- Lars
participants (6)
-
Clemens Ladisch
-
Joel Fernandes
-
Lars-Peter Clausen
-
Mark Brown
-
Mike Looijmans
-
Sekhar Nori