Re: [alsa-devel] [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support
Hi Morimoto-san,
On Tuesday 22 July 2014 19:17:23 Kuninori Morimoto wrote:
Hi Laurent
The code has been tested by artificially lowering the maximum chunk size to 4096 bytes and running dmatest, which completed sucessfully. Morimoto- san, is there an easy way to test cyclic transfers with your audio driver ?
Thank you for your offer. I tested this patchset with audio DMAC (= cyclic transfer) but, it doesn't work for me.
First of all, this sound driver which is using cyclic transfer was worked well in shdma-base driver. I had sent audio DMA support plafrom side patches before. But, of course I'm happy to update sound driver side.
I will re-send my audio DMAC support patches after this email.
My troubles are...
[snip]
cyclic transfer doesn't work
I got attached error.
[snip]
I can reproduce that, but I have this error coming up before.
[ 16.207027] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:103 [ 16.215795] in_atomic(): 1, irqs_disabled(): 128, pid: 1319, name: aplay [ 16.222636] CPU: 0 PID: 1319 Comm: aplay Not tainted 3.16.0-rc5-02821-g12a72a3 #2501 [ 16.230536] Backtrace: [ 16.233056] [<c00121e4>] (dump_backtrace) from [<c0012598>] (show_stack+0x18/0x1c) [ 16.240778] r6:ffffffff r5:c04aa7c0 r4:00000000 r3:00000000 [ 16.246593] [<c0012580>] (show_stack) from [<c032ea84>] (dump_stack+0x8c/0xc0) [ 16.253967] [<c032e9f8>] (dump_stack) from [<c0049e98>] (__might_sleep+0xcc/0x108) [ 16.261689] r4:e8dd2000 r3:00000093 [ 16.265357] [<c0049dcc>] (__might_sleep) from [<c0332134>] (mutex_lock+0x20/0x70) [ 16.272990] r5:00000000 r4:e900fe00 [ 16.276657] [<c0332114>] (mutex_lock) from [<c01fa4dc>] (regmap_lock_mutex+0x10/0x14) [ 16.284644] r4:e900fe00 r3:00000000 [ 16.288309] [<c01fa4cc>] (regmap_lock_mutex) from [<c01fb9dc>] (regmap_update_bits+0x2c/0x64) [ 16.297009] [<c01fb9b0>] (regmap_update_bits) from [<c01fba90>] (regmap_fields_write+0x38/0x44) [ 16.305883] r7:e8d9d990 r6:00000004 r5:00000040 r4:f0368200 [ 16.311701] [<c01fba58>] (regmap_fields_write) from [<bf0ec280>] (rsnd_write+0x30/0x4c [snd_soc_rcar]) [ 16.321195] r5:e93a4c00 r4:e8d1f898 [ 16.324866] [<bf0ec250>] (rsnd_write [snd_soc_rcar]) from [<bf0ec884>] (rsnd_src_set_convert_rate.isra.6+0xf8/0x144 [snd_soc_rcar]) [ 16.336940] [<bf0ec78c>] (rsnd_src_set_convert_rate.isra.6 [snd_soc_rcar]) from [<bf0ec8fc>] (rsnd_src_init_gen2+0x2c/0xc4 [snd_soc_rcar]) [ 16.349624] r6:00000004 r5:e8d9d810 r4:e8d1f898 r3:bf0ec8d0 [ 16.355438] [<bf0ec8d0>] (rsnd_src_init_gen2 [snd_soc_rcar]) from [<bf0ea640>] (rsnd_soc_dai_trigger+0x1cc/0x22c [snd_soc_rcar]) [ 16.367236] r5:e8d9d810 r4:e8d9d824 [ 16.370916] [<bf0ea474>] (rsnd_soc_dai_trigger [snd_soc_rcar]) from [<bf0c51ec>] (soc_pcm_trigger+0xa8/0xf8 [snd_soc_core]) [ 16.382271] r10:00002000 r9:00002000 r8:e9290d00 r7:e8d9d700 r6:00000001 r5:e99fb500 [ 16.390301] r4:e8ef3810 [ 16.392910] [<bf0c5144>] (soc_pcm_trigger [snd_soc_core]) from [<bf094704>] (snd_pcm_do_start+0x34/0x38 [snd_pcm]) [ 16.403467] r8:bf09e050 r7:00000000 r6:00000003 r5:e99fb500 r4:bf09e050 r3:bf0c5144 [ 16.411421] [<bf0946d0>] (snd_pcm_do_start [snd_pcm]) from [<bf0941f8>] (snd_pcm_action_single+0x40/0x80 [snd_pcm]) [ 16.422079] [<bf0941b8>] (snd_pcm_action_single [snd_pcm]) from [<bf09443c>] (snd_pcm_action+0xcc/0xd0 [snd_pcm]) [ 16.432547] r7:00000003 r6:e99fb5c8 r5:bf09e4c0 r4:e99fb500 [ 16.438366] [<bf094370>] (snd_pcm_action [snd_pcm]) from [<bf097098>] (snd_pcm_start+0x1c/0x24 [snd_pcm]) [ 16.448125] r8:00000000 r7:e8dd2000 r6:e93a4c00 r5:bf09e4c0 r4:e99fb500 r3:00002000 [ 16.456083] [<bf09707c>] (snd_pcm_start [snd_pcm]) from [<bf09b094>] (snd_pcm_lib_write1+0x40c/0x4f0 [snd_pcm]) [ 16.466391] [<bf09ac88>] (snd_pcm_lib_write1 [snd_pcm]) from [<bf09b244>] (snd_pcm_lib_write+0x64/0x78 [snd_pcm]) [ 16.476860] r10:be91ea4c r9:e8dd2000 r8:e8d66488 r7:00000000 r6:0002c780 r5:00000800 [ 16.484889] r4:e99fb500 [ 16.487494] [<bf09b1e0>] (snd_pcm_lib_write [snd_pcm]) from [<bf096c38>] (snd_pcm_playback_ioctl1+0x134/0x4c8 [snd_pcm]) [ 16.498583] r6:00000000 r5:be91ea4c r4:e99fb500 [ 16.503330] [<bf096b04>] (snd_pcm_playback_ioctl1 [snd_pcm]) from [<bf096ffc>] (snd_pcm_playback_ioctl+0x30/0x3c [snd_pcm]) [ 16.514685] r8:e8d66488 r7:be91ea4c r6:00000004 r5:e93d3880 r4:e93d3880 [ 16.521575] [<bf096fcc>] (snd_pcm_playback_ioctl [snd_pcm]) from [<c00d7c10>] (do_vfs_ioctl+0x80/0x5c8) [ 16.531163] [<c00d7b90>] (do_vfs_ioctl) from [<c00d8194>] (SyS_ioctl+0x3c/0x60) [ 16.538618] r10:00000000 r9:e8dd2000 r8:00000004 r7:be91ea4c r6:400c4150 r5:e93d3880 [ 16.546648] r4:e93d3880 [ 16.549243] [<c00d8158>] (SyS_ioctl) from [<c000f8a0>] (ret_fast_syscall+0x0/0x30) [ 16.556964] r8:c000fa24 r7:00000036 r6:00000000 r5:0002c498 r4:0002c448 r3:be91ea4c
The rsnd_soc_dai_trigger() function takes a spinlock, making the context atomic, which regmap doesn't like as it locks a mutex.
It might be possible to fix this by setting the fast_io field in both the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap will then use a spinlock instead of a mutex. However, even if I believe that change makes sense and should be done, another atomic context issue will come from the rcar-dmac driver which allocates memory in the prep_dma_cyclic function, called by rsnd_dma_start() from rsnd_soc_dai_trigger() with the spinlock help.
What context is the rsnd_soc_dai_trigger() function called in by the alsa core ? If it's guaranteed to be a sleepable context, would it make sense to replace the rsnd_priv spinlock with a mutex ?
(Expanding the CC list)
On Wednesday 23 July 2014 12:28:47 Laurent Pinchart wrote:
On Tuesday 22 July 2014 19:17:23 Kuninori Morimoto wrote:
Hi Laurent
The code has been tested by artificially lowering the maximum chunk size to 4096 bytes and running dmatest, which completed sucessfully. Morimoto- san, is there an easy way to test cyclic transfers with your audio driver ?
Thank you for your offer. I tested this patchset with audio DMAC (= cyclic transfer) but, it doesn't work for me.
First of all, this sound driver which is using cyclic transfer was worked well in shdma-base driver. I had sent audio DMA support plafrom side patches before. But, of course I'm happy to update sound driver side.
I will re-send my audio DMAC support patches after this email.
My troubles are...
[snip]
cyclic transfer doesn't work
I got attached error.
[snip]
I can reproduce that, but I have this error coming up before.
[ 16.207027] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:103 [ 16.215795] in_atomic(): 1, irqs_disabled(): 128, pid: 1319, name: aplay [ 16.222636] CPU: 0 PID: 1319 Comm: aplay Not tainted 3.16.0-rc5-02821-g12a72a3 #2501 [ 16.230536] Backtrace: [ 16.233056] [<c00121e4>] (dump_backtrace) from [<c0012598>] (show_stack+0x18/0x1c) [ 16.240778] r6:ffffffff r5:c04aa7c0 r4:00000000 r3:00000000 [ 16.246593] [<c0012580>] (show_stack) from [<c032ea84>] (dump_stack+0x8c/0xc0) [ 16.253967] [<c032e9f8>] (dump_stack) from [<c0049e98>] (__might_sleep+0xcc/0x108) [ 16.261689] r4:e8dd2000 r3:00000093 [ 16.265357] [<c0049dcc>] (__might_sleep) from [<c0332134>] (mutex_lock+0x20/0x70) [ 16.272990] r5:00000000 r4:e900fe00 [ 16.276657] [<c0332114>] (mutex_lock) from [<c01fa4dc>] (regmap_lock_mutex+0x10/0x14) [ 16.284644] r4:e900fe00 r3:00000000 [ 16.288309] [<c01fa4cc>] (regmap_lock_mutex) from [<c01fb9dc>] (regmap_update_bits+0x2c/0x64) [ 16.297009] [<c01fb9b0>] (regmap_update_bits) from [<c01fba90>] (regmap_fields_write+0x38/0x44) [ 16.305883] r7:e8d9d990 r6:00000004 r5:00000040 r4:f0368200 [ 16.311701] [<c01fba58>] (regmap_fields_write) from [<bf0ec280>] (rsnd_write+0x30/0x4c [snd_soc_rcar]) [ 16.321195] r5:e93a4c00 r4:e8d1f898 [ 16.324866] [<bf0ec250>] (rsnd_write [snd_soc_rcar]) from [<bf0ec884>] (rsnd_src_set_convert_rate.isra.6+0xf8/0x144 [snd_soc_rcar]) [ 16.336940] [<bf0ec78c>] (rsnd_src_set_convert_rate.isra.6 [snd_soc_rcar]) from [<bf0ec8fc>] (rsnd_src_init_gen2+0x2c/0xc4 [snd_soc_rcar]) [ 16.349624] r6:00000004 r5:e8d9d810 r4:e8d1f898 r3:bf0ec8d0 [ 16.355438] [<bf0ec8d0>] (rsnd_src_init_gen2 [snd_soc_rcar]) from [<bf0ea640>] (rsnd_soc_dai_trigger+0x1cc/0x22c [snd_soc_rcar]) [ 16.367236] r5:e8d9d810 r4:e8d9d824 [ 16.370916] [<bf0ea474>] (rsnd_soc_dai_trigger [snd_soc_rcar]) from [<bf0c51ec>] (soc_pcm_trigger+0xa8/0xf8 [snd_soc_core]) [ 16.382271] r10:00002000 r9:00002000 r8:e9290d00 r7:e8d9d700 r6:00000001 r5:e99fb500 [ 16.390301] r4:e8ef3810 [ 16.392910] [<bf0c5144>] (soc_pcm_trigger [snd_soc_core]) from [<bf094704>] (snd_pcm_do_start+0x34/0x38 [snd_pcm]) [ 16.403467] r8:bf09e050 r7:00000000 r6:00000003 r5:e99fb500 r4:bf09e050 r3:bf0c5144 [ 16.411421] [<bf0946d0>] (snd_pcm_do_start [snd_pcm]) from [<bf0941f8>] (snd_pcm_action_single+0x40/0x80 [snd_pcm]) [ 16.422079] [<bf0941b8>] (snd_pcm_action_single [snd_pcm]) from [<bf09443c>] (snd_pcm_action+0xcc/0xd0 [snd_pcm]) [ 16.432547] r7:00000003 r6:e99fb5c8 r5:bf09e4c0 r4:e99fb500 [ 16.438366] [<bf094370>] (snd_pcm_action [snd_pcm]) from [<bf097098>] (snd_pcm_start+0x1c/0x24 [snd_pcm]) [ 16.448125] r8:00000000 r7:e8dd2000 r6:e93a4c00 r5:bf09e4c0 r4:e99fb500 r3:00002000 [ 16.456083] [<bf09707c>] (snd_pcm_start [snd_pcm]) from [<bf09b094>] (snd_pcm_lib_write1+0x40c/0x4f0 [snd_pcm]) [ 16.466391] [<bf09ac88>] (snd_pcm_lib_write1 [snd_pcm]) from [<bf09b244>] (snd_pcm_lib_write+0x64/0x78 [snd_pcm]) [ 16.476860] r10:be91ea4c r9:e8dd2000 r8:e8d66488 r7:00000000 r6:0002c780 r5:00000800 [ 16.484889] r4:e99fb500 [ 16.487494] [<bf09b1e0>] (snd_pcm_lib_write [snd_pcm]) from [<bf096c38>] (snd_pcm_playback_ioctl1+0x134/0x4c8 [snd_pcm]) [ 16.498583] r6:00000000 r5:be91ea4c r4:e99fb500 [ 16.503330] [<bf096b04>] (snd_pcm_playback_ioctl1 [snd_pcm]) from [<bf096ffc>] (snd_pcm_playback_ioctl+0x30/0x3c [snd_pcm]) [ 16.514685] r8:e8d66488 r7:be91ea4c r6:00000004 r5:e93d3880 r4:e93d3880 [ 16.521575] [<bf096fcc>] (snd_pcm_playback_ioctl [snd_pcm]) from [<c00d7c10>] (do_vfs_ioctl+0x80/0x5c8) [ 16.531163] [<c00d7b90>] (do_vfs_ioctl) from [<c00d8194>] (SyS_ioctl+0x3c/0x60) [ 16.538618] r10:00000000 r9:e8dd2000 r8:00000004 r7:be91ea4c r6:400c4150 r5:e93d3880 [ 16.546648] r4:e93d3880 [ 16.549243] [<c00d8158>] (SyS_ioctl) from [<c000f8a0>] (ret_fast_syscall+0x0/0x30) [ 16.556964] r8:c000fa24 r7:00000036 r6:00000000 r5:0002c498 r4:0002c448 r3:be91ea4c
The rsnd_soc_dai_trigger() function takes a spinlock, making the context atomic, which regmap doesn't like as it locks a mutex.
It might be possible to fix this by setting the fast_io field in both the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap will then use a spinlock instead of a mutex. However, even if I believe that change makes sense and should be done, another atomic context issue will come from the rcar-dmac driver which allocates memory in the prep_dma_cyclic function, called by rsnd_dma_start() from rsnd_soc_dai_trigger() with the spinlock help.
What context is the rsnd_soc_dai_trigger() function called in by the alsa core ? If it's guaranteed to be a sleepable context, would it make sense to replace the rsnd_priv spinlock with a mutex ?
Answering myself here, that's not an option, as the trigger function is called in atomic context (replacing the spinlock with a mutex produced a clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being called in atomic context, and on the other side the function ends up calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more func the DMA engine API is undocumented and completely silent on whether the prep functions are allowed to sleep. The question is, who's wrong ?
Now, if you're tempted to say that I'm wrong by allocating memory with GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem more complex to solve.
The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is suspended. The driver calls pm_runtime_get_sync() to resume the device, and needs to do so when a descriptor is submitted. This operation, currently performed in the tx_submit handler, could be moved to the prep_dma_cyclic or issue_pending handler, but the three operations are called in a row from rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the spinlock held. This means I have no place in my DMA engine driver where I can resume the device.
One could argue that the rcar-dmac driver could use a work queue to handle power management. That's correct, but the additional complexity, which would be required in *all* DMA engine drivers, seem too high to me. If we need to go that way, this is really a task that the DMA engine core should handle.
Let's start by answering the background question and updating the DMA engine API documentation once and for good : in which context are drivers allowed to call the prep, tx_submit and issue_pending functions ?
Hi Laurent
The rsnd_soc_dai_trigger() function takes a spinlock, making the context atomic, which regmap doesn't like as it locks a mutex.
It might be possible to fix this by setting the fast_io field in both the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap will then use a spinlock instead of a mutex. However, even if I believe that change makes sense and should be done, another atomic context issue will come from the rcar-dmac driver which allocates memory in the prep_dma_cyclic function, called by rsnd_dma_start() from rsnd_soc_dai_trigger() with the spinlock help.
What context is the rsnd_soc_dai_trigger() function called in by the alsa core ? If it's guaranteed to be a sleepable context, would it make sense to replace the rsnd_priv spinlock with a mutex ?
Answering myself here, that's not an option, as the trigger function is called in atomic context (replacing the spinlock with a mutex produced a clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being called in atomic context, and on the other side the function ends up calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more func the DMA engine API is undocumented and completely silent on whether the prep functions are allowed to sleep. The question is, who's wrong ?
Now, if you're tempted to say that I'm wrong by allocating memory with GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem more complex to solve.
The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is suspended. The driver calls pm_runtime_get_sync() to resume the device, and needs to do so when a descriptor is submitted. This operation, currently performed in the tx_submit handler, could be moved to the prep_dma_cyclic or issue_pending handler, but the three operations are called in a row from rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the spinlock held. This means I have no place in my DMA engine driver where I can resume the device.
One could argue that the rcar-dmac driver could use a work queue to handle power management. That's correct, but the additional complexity, which would be required in *all* DMA engine drivers, seem too high to me. If we need to go that way, this is really a task that the DMA engine core should handle.
Let's start by answering the background question and updating the DMA engine API documentation once and for good : in which context are drivers allowed to call the prep, tx_submit and issue_pending functions ?
rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now, but, it had been used prep_slave_single() before. Then, it used work queue in dai_trigger function. How about to use same method in prep_dma_cyclic() ? Do you think your issue will be solved if sound driver calls prep_dma_cyclic() from work queue ?
Best regards --- Kuninori Morimoto
Hi Laurent again
The rsnd_soc_dai_trigger() function takes a spinlock, making the context atomic, which regmap doesn't like as it locks a mutex.
It might be possible to fix this by setting the fast_io field in both the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap will then use a spinlock instead of a mutex. However, even if I believe that change makes sense and should be done, another atomic context issue will come from the rcar-dmac driver which allocates memory in the prep_dma_cyclic function, called by rsnd_dma_start() from rsnd_soc_dai_trigger() with the spinlock help.
What context is the rsnd_soc_dai_trigger() function called in by the alsa core ? If it's guaranteed to be a sleepable context, would it make sense to replace the rsnd_priv spinlock with a mutex ?
Answering myself here, that's not an option, as the trigger function is called in atomic context (replacing the spinlock with a mutex produced a clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being called in atomic context, and on the other side the function ends up calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more func the DMA engine API is undocumented and completely silent on whether the prep functions are allowed to sleep. The question is, who's wrong ?
Now, if you're tempted to say that I'm wrong by allocating memory with GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem more complex to solve.
The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is suspended. The driver calls pm_runtime_get_sync() to resume the device, and needs to do so when a descriptor is submitted. This operation, currently performed in the tx_submit handler, could be moved to the prep_dma_cyclic or issue_pending handler, but the three operations are called in a row from rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the spinlock held. This means I have no place in my DMA engine driver where I can resume the device.
One could argue that the rcar-dmac driver could use a work queue to handle power management. That's correct, but the additional complexity, which would be required in *all* DMA engine drivers, seem too high to me. If we need to go that way, this is really a task that the DMA engine core should handle.
Let's start by answering the background question and updating the DMA engine API documentation once and for good : in which context are drivers allowed to call the prep, tx_submit and issue_pending functions ?
rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now, but, it had been used prep_slave_single() before. Then, it used work queue in dai_trigger function. How about to use same method in prep_dma_cyclic() ? Do you think your issue will be solved if sound driver calls prep_dma_cyclic() from work queue ?
Sorry, this doesn't solve issue. dmaengine_prep_dma_cyclic() is used in ${LINUX}/sound/core/pcm_dmaengine.c, and the situation is same as ours.
Hmm.. In my quick check, other DMAEngine drivers are using GFP_ATOMIC in cyclic/prep_slave_sg functions...
Best regards --- Kuninori Morimoto
On Wed, Jul 23, 2014 at 06:35:17PM -0700, Kuninori Morimoto wrote:
Hi Laurent again
The rsnd_soc_dai_trigger() function takes a spinlock, making the context atomic, which regmap doesn't like as it locks a mutex.
It might be possible to fix this by setting the fast_io field in both the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap will then use a spinlock instead of a mutex. However, even if I believe that change makes sense and should be done, another atomic context issue will come from the rcar-dmac driver which allocates memory in the prep_dma_cyclic function, called by rsnd_dma_start() from rsnd_soc_dai_trigger() with the spinlock help.
What context is the rsnd_soc_dai_trigger() function called in by the alsa core ? If it's guaranteed to be a sleepable context, would it make sense to replace the rsnd_priv spinlock with a mutex ?
Answering myself here, that's not an option, as the trigger function is called in atomic context (replacing the spinlock with a mutex produced a clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being called in atomic context, and on the other side the function ends up calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more func the DMA engine API is undocumented and completely silent on whether the prep functions are allowed to sleep. The question is, who's wrong ?
Now, if you're tempted to say that I'm wrong by allocating memory with GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem more complex to solve.
The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is suspended. The driver calls pm_runtime_get_sync() to resume the device, and needs to do so when a descriptor is submitted. This operation, currently performed in the tx_submit handler, could be moved to the prep_dma_cyclic or issue_pending handler, but the three operations are called in a row from rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the spinlock held. This means I have no place in my DMA engine driver where I can resume the device.
One could argue that the rcar-dmac driver could use a work queue to handle power management. That's correct, but the additional complexity, which would be required in *all* DMA engine drivers, seem too high to me. If we need to go that way, this is really a task that the DMA engine core should handle.
Let's start by answering the background question and updating the DMA engine API documentation once and for good : in which context are drivers allowed to call the prep, tx_submit and issue_pending functions ?
rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now, but, it had been used prep_slave_single() before. Then, it used work queue in dai_trigger function. How about to use same method in prep_dma_cyclic() ? Do you think your issue will be solved if sound driver calls prep_dma_cyclic() from work queue ?
Sorry, this doesn't solve issue. dmaengine_prep_dma_cyclic() is used in ${LINUX}/sound/core/pcm_dmaengine.c, and the situation is same as ours.
Hmm.. In my quick check, other DMAEngine drivers are using GFP_ATOMIC in cyclic/prep_slave_sg functions...
And thats partially right. You need to use GFP_NOWAIT.
On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
The rsnd_soc_dai_trigger() function takes a spinlock, making the context atomic, which regmap doesn't like as it locks a mutex.
It might be possible to fix this by setting the fast_io field in both the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap will then use a spinlock instead of a mutex. However, even if I believe that change makes sense and should be done, another atomic context issue will come from the rcar-dmac driver which allocates memory in the prep_dma_cyclic function, called by rsnd_dma_start() from rsnd_soc_dai_trigger() with the spinlock help.
What context is the rsnd_soc_dai_trigger() function called in by the alsa core ? If it's guaranteed to be a sleepable context, would it make sense to replace the rsnd_priv spinlock with a mutex ?
Answering myself here, that's not an option, as the trigger function is called in atomic context (replacing the spinlock with a mutex produced a clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being called in atomic context, and on the other side the function ends up calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more func the DMA engine API is undocumented and completely silent on whether the prep functions are allowed to sleep. The question is, who's wrong ?
Now, if you're tempted to say that I'm wrong by allocating memory with GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem more complex to solve.
The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is suspended. The driver calls pm_runtime_get_sync() to resume the device, and needs to do so when a descriptor is submitted. This operation, currently performed in the tx_submit handler, could be moved to the prep_dma_cyclic or issue_pending handler, but the three operations are called in a row from rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the spinlock held. This means I have no place in my DMA engine driver where I can resume the device.
One could argue that the rcar-dmac driver could use a work queue to handle power management. That's correct, but the additional complexity, which would be required in *all* DMA engine drivers, seem too high to me. If we need to go that way, this is really a task that the DMA engine core should handle.
Let's start by answering the background question and updating the DMA engine API documentation once and for good : in which context are drivers allowed to call the prep, tx_submit and issue_pending functions ?
I think this was bought up sometime back and we have cleared that all _prep functions can be invoked in atomic context.
This is the reason why we have been pushing folks to use GFP_NOWAIT is memory allocations during prepare.
Thanks for pointing out documentation doesn't say so, will send a patch for that.
On issue_pending and tx_submit, yes these should be allowed to be called from atomic context too.
Lastly, just to clarify the callback invoked after descriptor is complete can also be used to submit new descriptors, so folks are dropping locking before invoking the callback
HTH
Hi Vinod,
On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
The rsnd_soc_dai_trigger() function takes a spinlock, making the context atomic, which regmap doesn't like as it locks a mutex.
It might be possible to fix this by setting the fast_io field in both the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap will then use a spinlock instead of a mutex. However, even if I believe that change makes sense and should be done, another atomic context issue will come from the rcar-dmac driver which allocates memory in the prep_dma_cyclic function, called by rsnd_dma_start() from rsnd_soc_dai_trigger() with the spinlock help.
What context is the rsnd_soc_dai_trigger() function called in by the alsa core ? If it's guaranteed to be a sleepable context, would it make sense to replace the rsnd_priv spinlock with a mutex ?
Answering myself here, that's not an option, as the trigger function is called in atomic context (replacing the spinlock with a mutex produced a clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being called in atomic context, and on the other side the function ends up calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more func the DMA engine API is undocumented and completely silent on whether the prep functions are allowed to sleep. The question is, who's wrong ?
Now, if you're tempted to say that I'm wrong by allocating memory with GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem more complex to solve.
The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is suspended. The driver calls pm_runtime_get_sync() to resume the device, and needs to do so when a descriptor is submitted. This operation, currently performed in the tx_submit handler, could be moved to the prep_dma_cyclic or issue_pending handler, but the three operations are called in a row from rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the spinlock held. This means I have no place in my DMA engine driver where I can resume the device.
One could argue that the rcar-dmac driver could use a work queue to handle power management. That's correct, but the additional complexity, which would be required in *all* DMA engine drivers, seem too high to me. If we need to go that way, this is really a task that the DMA engine core should handle.
Let's start by answering the background question and updating the DMA engine API documentation once and for good : in which context are drivers allowed to call the prep, tx_submit and issue_pending functions ?
I think this was bought up sometime back and we have cleared that all _prep functions can be invoked in atomic context.
This is the reason why we have been pushing folks to use GFP_NOWAIT is memory allocations during prepare.
From the replies I've received it's pretty clear that the prep functions need
to be callable from atomic context. I'll respond to this in more depth in a reply to Russell's e-mail.
Thanks for pointing out documentation doesn't say so, will send a patch for that.
I wish that was all that is missing from the documentation ;-) Luckily Maxime Ripard has sent a patch that documents DMA engine from a DMA engine driver's point of view. While not perfect (I'm going to review it), it's a nice starting point to (hopefully) get to a properly documented framework.
On issue_pending and tx_submit, yes these should be allowed to be called from atomic context too.
I'll take this opportunity to question why we have a separation between tx_submit and issue_pending. What's the rationale for that, especially given that dma_issue_pending_all() might kick in at any point and issue pending transfers for all devices. A driver could thus see its submitted but not issued transactions being issued before it explicitly calls dma_async_issue_pending().
The DMA_PRIVATE capability flag seems to play a role here, but it's far from being clear how that mechanism is supposed to work. This should be documented as well, and any light you could shed on this dark corner of the API would help.
Similarly, the DMA engine API is split in functions with different prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such as async_tx_ack or txd_lock. If there's a rationale for that (beyond just historical reasons) it should also be documented, otherwise a cleanup would help all the confused DMA engine users (myself included). I might be able to find a bit of time to work on that, but I'll first need to correctly understand where we come from and where we are. Again, information would be welcome and fully appreciated.
Lastly, just to clarify the callback invoked after descriptor is complete can also be used to submit new descriptors, so folks are dropping locking before invoking the callback
On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
I'll take this opportunity to question why we have a separation between tx_submit and issue_pending. What's the rationale for that, especially given that dma_issue_pending_all() might kick in at any point and issue pending transfers for all devices. A driver could thus see its submitted but not issued transactions being issued before it explicitly calls dma_async_issue_pending().
A prepared but not submitted transaction is not a pending transaction.
The split is necessary so that a callback can be attached to the transaction. This partially comes from the async-tx API, and also gets a lot of use with the slave API.
The prepare function allocates the descriptor and does the initial setup, but does not mark the descriptor as a pending transaction. It returns the descriptor, and the caller is then free to add a callback function and data pointer to the descriptor before finally submitting it. This sequence must occur in a timely manner as some DMA engine implementations hold a lock between the prepare and submit callbacks (Dan explicitly permits this as part of the API.)
The DMA_PRIVATE capability flag seems to play a role here, but it's far from being clear how that mechanism is supposed to work. This should be documented as well, and any light you could shed on this dark corner of the API would help.
Why do you think that DMA_PRIVATE has a bearing in the callbacks? It doesn't. DMA_PRIVATE is all about channel allocation as I explained yesterday, and whether the channel is available for async_tx usage.
A channel marked DMA_PRIVATE is not available for async_tx usage at any moment. A channel without DMA_PRIVATE is available for async_tx usage until it is allocated for the slave API - at which point the generic DMA engine code will mark the channel with DMA_PRIVATE, thereby taking it away from async_tx API usage. When the slave API frees the channel, DMA_PRIVATE will be cleared, making the channel available for async_tx usage.
So, basically, DMA_PRIVATE set -> async_tx usage not allowed. DMA_PRIVATE clear -> async_tx usage permitted. It really is that simple.
Similarly, the DMA engine API is split in functions with different prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such as async_tx_ack or txd_lock. If there's a rationale for that (beyond just historical reasons) it should also be documented, otherwise a cleanup would help all the confused DMA engine users (myself included).
dmaengine_* are generally the interface functions to the DMA engine code, which have been recently introduced to avoid the multiple levels of pointer indirection having to be typed in every driver.
dma_async_* are the DMA engine interface functions for the async_tx API.
dma_* predate the dmaengine_* naming, and are probably too generic, so should probably end up being renamed to dmaengine_*.
txd_* are all about the DMA engine descriptors.
async_tx_* are the higher level async_tx API functions.
On Fri, Aug 01, 2014 at 03:30:20PM +0100, Russell King - ARM Linux wrote:
On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
I'll take this opportunity to question why we have a separation between tx_submit and issue_pending. What's the rationale for that, especially given that dma_issue_pending_all() might kick in at any point and issue pending transfers for all devices. A driver could thus see its submitted but not issued transactions being issued before it explicitly calls dma_async_issue_pending().
A prepared but not submitted transaction is not a pending transaction.
The split is necessary so that a callback can be attached to the transaction. This partially comes from the async-tx API, and also gets a lot of use with the slave API.
The prepare function allocates the descriptor and does the initial setup, but does not mark the descriptor as a pending transaction. It returns the descriptor, and the caller is then free to add a callback function and data pointer to the descriptor before finally submitting it. This sequence must occur in a timely manner as some DMA engine implementations hold a lock between the prepare and submit callbacks (Dan explicitly permits this as part of the API.)
The DMA_PRIVATE capability flag seems to play a role here, but it's far from being clear how that mechanism is supposed to work. This should be documented as well, and any light you could shed on this dark corner of the API would help.
Why do you think that DMA_PRIVATE has a bearing in the callbacks? It doesn't. DMA_PRIVATE is all about channel allocation as I explained yesterday, and whether the channel is available for async_tx usage.
A channel marked DMA_PRIVATE is not available for async_tx usage at any moment. A channel without DMA_PRIVATE is available for async_tx usage until it is allocated for the slave API - at which point the generic DMA engine code will mark the channel with DMA_PRIVATE, thereby taking it away from async_tx API usage. When the slave API frees the channel, DMA_PRIVATE will be cleared, making the channel available for async_tx usage.
So, basically, DMA_PRIVATE set -> async_tx usage not allowed. DMA_PRIVATE clear -> async_tx usage permitted. It really is that simple.
Similarly, the DMA engine API is split in functions with different prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such as async_tx_ack or txd_lock. If there's a rationale for that (beyond just historical reasons) it should also be documented, otherwise a cleanup would help all the confused DMA engine users (myself included).
dmaengine_* are generally the interface functions to the DMA engine code, which have been recently introduced to avoid the multiple levels of pointer indirection having to be typed in every driver.
dma_async_* are the DMA engine interface functions for the async_tx API.
dma_* predate the dmaengine_* naming, and are probably too generic, so should probably end up being renamed to dmaengine_*.
txd_* are all about the DMA engine descriptors.
async_tx_* are the higher level async_tx API functions.
Ah looks like I repeated the good answers from you. Should have read all replied first
Hi Russell,
On Fri, Aug 1, 2014 at 4:30 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
I'll take this opportunity to question why we have a separation between tx_submit and issue_pending. What's the rationale for that, especially given that dma_issue_pending_all() might kick in at any point and issue pending transfers for all devices. A driver could thus see its submitted but not issued transactions being issued before it explicitly calls dma_async_issue_pending().
A prepared but not submitted transaction is not a pending transaction.
The split is necessary so that a callback can be attached to the transaction. This partially comes from the async-tx API, and also gets a lot of use with the slave API.
The prepare function allocates the descriptor and does the initial setup, but does not mark the descriptor as a pending transaction. It returns the descriptor, and the caller is then free to add a callback function and data pointer to the descriptor before finally submitting it. This sequence must occur in a timely manner as some DMA engine implementations hold a lock between the prepare and submit callbacks (Dan explicitly permits this as part of the API.)
I think you misunderstood the question: Laurent asked about dmaengine_submit() (step 2) and dma_async_issue_pending() (step 3), while your answer is about dmaengine_prep_slave_*() (step 1) and dmaengine_submit() (step 2).
Gr{oetje,eeting}s,
Geert
-- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Hi Russell,
On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
I'll take this opportunity to question why we have a separation between tx_submit and issue_pending. What's the rationale for that, especially given that dma_issue_pending_all() might kick in at any point and issue pending transfers for all devices. A driver could thus see its submitted but not issued transactions being issued before it explicitly calls dma_async_issue_pending().
A prepared but not submitted transaction is not a pending transaction.
The split is necessary so that a callback can be attached to the transaction. This partially comes from the async-tx API, and also gets a lot of use with the slave API.
The prepare function allocates the descriptor and does the initial setup, but does not mark the descriptor as a pending transaction. It returns the descriptor, and the caller is then free to add a callback function and data pointer to the descriptor before finally submitting it.
No disagreement on that. However, as Geert pointed out, my question was related to the split between dmaengine_submit() and dma_async_issue_pending(), not between the prep_* functions and dmaengine_submit().
This sequence must occur in a timely manner as some DMA engine implementations hold a lock between the prepare and submit callbacks (Dan explicitly permits this as part of the API.)
That really triggers a red alarm in the part of my brain that deals with API design, but I suppose it would be too difficult to change that.
The DMA_PRIVATE capability flag seems to play a role here, but it's far from being clear how that mechanism is supposed to work. This should be documented as well, and any light you could shed on this dark corner of the API would help.
Why do you think that DMA_PRIVATE has a bearing in the callbacks? It doesn't.
Not on callbacks, but on how pending descriptors are pushed to the hardware. The flag is explicitly checked in dma_issue_pending_all().
DMA_PRIVATE is all about channel allocation as I explained yesterday, and whether the channel is available for async_tx usage.
A channel marked DMA_PRIVATE is not available for async_tx usage at any moment. A channel without DMA_PRIVATE is available for async_tx usage until it is allocated for the slave API - at which point the generic DMA engine code will mark the channel with DMA_PRIVATE, thereby taking it away from async_tx API usage. When the slave API frees the channel, DMA_PRIVATE will be cleared, making the channel available for async_tx usage.
So, basically, DMA_PRIVATE set -> async_tx usage not allowed. DMA_PRIVATE clear -> async_tx usage permitted. It really is that simple.
DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one channel is allocated by __dma_request_channel() the whole device is marked with DMA_PRIVATE, making all channels private. What am I missing ?
Similarly, the DMA engine API is split in functions with different prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such as async_tx_ack or txd_lock. If there's a rationale for that (beyond just historical reasons) it should also be documented, otherwise a cleanup would help all the confused DMA engine users (myself included).
dmaengine_* are generally the interface functions to the DMA engine code, which have been recently introduced to avoid the multiple levels of pointer indirection having to be typed in every driver.
dma_async_* are the DMA engine interface functions for the async_tx API.
dma_* predate the dmaengine_* naming, and are probably too generic, so should probably end up being renamed to dmaengine_*.
Thank you for the confirmation. I'll see if I can cook up a patch. It will likely be pretty large and broad though, but I guess there's no way around it.
txd_* are all about the DMA engine descriptors.
async_tx_* are the higher level async_tx API functions.
Thank you for the information. How about the dma_async_* functions, should they be renamed to dmaengine_* as well ? Or are some of them part of the async_tx_* API ?
On Mon, Aug 04, 2014 at 07:00:36PM +0200, Laurent Pinchart wrote:
Hi Russell,
On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
This sequence must occur in a timely manner as some DMA engine implementations hold a lock between the prepare and submit callbacks (Dan explicitly permits this as part of the API.)
That really triggers a red alarm in the part of my brain that deals with API design, but I suppose it would be too difficult to change that.
Mine to, but there's not a lot which can be done about it without changing a lot of users.
The DMA_PRIVATE capability flag seems to play a role here, but it's far from being clear how that mechanism is supposed to work. This should be documented as well, and any light you could shed on this dark corner of the API would help.
Why do you think that DMA_PRIVATE has a bearing in the callbacks? It doesn't.
Not on callbacks, but on how pending descriptors are pushed to the hardware. The flag is explicitly checked in dma_issue_pending_all().
Right. So, let me put a question to you - what do you think is the effect of the check in dma_issue_pending_all()?
I'll give you a hint - disregard the comment at the top of the function, because that's out of date.
DMA_PRIVATE is all about channel allocation as I explained yesterday, and whether the channel is available for async_tx usage.
A channel marked DMA_PRIVATE is not available for async_tx usage at any moment. A channel without DMA_PRIVATE is available for async_tx usage until it is allocated for the slave API - at which point the generic DMA engine code will mark the channel with DMA_PRIVATE, thereby taking it away from async_tx API usage. When the slave API frees the channel, DMA_PRIVATE will be cleared, making the channel available for async_tx usage.
So, basically, DMA_PRIVATE set -> async_tx usage not allowed. DMA_PRIVATE clear -> async_tx usage permitted. It really is that simple.
DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one channel is allocated by __dma_request_channel() the whole device is marked with DMA_PRIVATE, making all channels private. What am I missing ?
I can't answer that - I don't know why the previous authors decided to make it a DMA-device wide property - presumably there are DMA controllers where this matters.
However, one thing to realise is that a dma_device is a virtual concept - it is a set of channels which share a common set of properties. It is not a physical device. It is entirely reasonable for a set of channels on a physical device to be shared between two different dma_device instances and handed out by the driver code as needed.
Similarly, the DMA engine API is split in functions with different prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such as async_tx_ack or txd_lock. If there's a rationale for that (beyond just historical reasons) it should also be documented, otherwise a cleanup would help all the confused DMA engine users (myself included).
dmaengine_* are generally the interface functions to the DMA engine code, which have been recently introduced to avoid the multiple levels of pointer indirection having to be typed in every driver.
dma_async_* are the DMA engine interface functions for the async_tx API.
dma_* predate the dmaengine_* naming, and are probably too generic, so should probably end up being renamed to dmaengine_*.
Thank you for the confirmation. I'll see if I can cook up a patch. It will likely be pretty large and broad though, but I guess there's no way around it.
txd_* are all about the DMA engine descriptors.
async_tx_* are the higher level async_tx API functions.
Thank you for the information. How about the dma_async_* functions, should they be renamed to dmaengine_* as well ? Or are some of them part of the async_tx_* API ?
Well, these:
dma_async_device_register dma_async_device_unregister dma_async_tx_descriptor_init
are more DMA engine core <-> DMA engine driver interface functions than user functions. The remainder of the dma_async_* functions are internal to the async_tx API.
Hi Russell,
On Monday 04 August 2014 18:54:58 Russell King - ARM Linux wrote:
On Mon, Aug 04, 2014 at 07:00:36PM +0200, Laurent Pinchart wrote:
On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
This sequence must occur in a timely manner as some DMA engine
implementations hold a lock between the prepare and submit callbacks (Dan explicitly permits this as part of the API.)
That really triggers a red alarm in the part of my brain that deals with API design, but I suppose it would be too difficult to change that.
Mine to, but there's not a lot which can be done about it without changing a lot of users.
Well, the good side is that it "only" requires enough motivation and free time then :-)
The DMA_PRIVATE capability flag seems to play a role here, but it's far from being clear how that mechanism is supposed to work. This should be documented as well, and any light you could shed on this dark corner of the API would help.
Why do you think that DMA_PRIVATE has a bearing in the callbacks? It doesn't.
Not on callbacks, but on how pending descriptors are pushed to the hardware. The flag is explicitly checked in dma_issue_pending_all().
Right. So, let me put a question to you - what do you think is the effect of the check in dma_issue_pending_all()?
I'll give you a hint - disregard the comment at the top of the function, because that's out of date.
I suppose the idea is that dma_issue_pending_all() is only used for memcpy offload, and can thus ignore channels used for slave transfers.
The code seems to be buggy though. A DMA engine that can serve both memcpy and slave transfers could have one channel allocated for memcpy first, then a second channel allocated for slave transfers. This would cause the DMA_PRIVATE flag to be set, which will prevent dma_issue_pending_all() from calling the device_issue_pending operation of the memcpy channel.
DMA_PRIVATE is all about channel allocation as I explained yesterday, and whether the channel is available for async_tx usage.
A channel marked DMA_PRIVATE is not available for async_tx usage at any moment. A channel without DMA_PRIVATE is available for async_tx usage until it is allocated for the slave API - at which point the generic DMA engine code will mark the channel with DMA_PRIVATE, thereby taking it away from async_tx API usage. When the slave API frees the channel, DMA_PRIVATE will be cleared, making the channel available for async_tx usage.
So, basically, DMA_PRIVATE set -> async_tx usage not allowed. DMA_PRIVATE clear -> async_tx usage permitted. It really is that simple.
DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one channel is allocated by __dma_request_channel() the whole device is marked with DMA_PRIVATE, making all channels private. What am I missing ?
I can't answer that - I don't know why the previous authors decided to make it a DMA-device wide property - presumably there are DMA controllers where this matters.
If I understand the history correctly, the reason to make DMA_PRIVATE a device flag is to avoid starving slaves by allocating all channels of a device for memcpy. If the DMA_PRIVATE flag is set by the DMA engine driver that works as expected. If the DMA engine can be used for both, though, there's no guarantee, and the behaviour won't be very predictable.
By the way, shouldn't DMA_PRIVATE be renamed to something more explicit, such as DMA_NO_MEMCPY or DMA_SLAVE_ONLY ?
However, one thing to realise is that a dma_device is a virtual concept - it is a set of channels which share a common set of properties. It is not a physical device. It is entirely reasonable for a set of channels on a physical device to be shared between two different dma_device instances and handed out by the driver code as needed.
When the channels are independent, sure, but they sometimes share hardware resources. For instance the Renesas R-Car Gen2 SoCs have 2 generic-purpose DMA engines usable for both memcpy and slave transfers, each of them having 15 channels. Each DMA engine arbitrates memory accesses from the different channels, using either fixed priorities, or a round-robin arbitration. In that case it wouldn't make much sense to split the 15 channels across several dma_device instances.
I actually have the opposite problem, in my case channels of physically separate DMA engines can be used interchangeably to serve the system's slaves. Using the DMA engine DT bindings, DT nodes of the slaves currently reference a specific DMA engine, even if they can be served by both. This leads to limited dynamic channel allocation capabilities (especially when taking into account lazy channel allocation as mentioned in another mail in this thread).
Similarly, the DMA engine API is split in functions with different prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such as async_tx_ack or txd_lock. If there's a rationale for that (beyond just historical reasons) it should also be documented, otherwise a cleanup would help all the confused DMA engine users (myself included).
dmaengine_* are generally the interface functions to the DMA engine code, which have been recently introduced to avoid the multiple levels of pointer indirection having to be typed in every driver.
dma_async_* are the DMA engine interface functions for the async_tx API.
dma_* predate the dmaengine_* naming, and are probably too generic, so should probably end up being renamed to dmaengine_*.
Thank you for the confirmation. I'll see if I can cook up a patch. It will likely be pretty large and broad though, but I guess there's no way around it.
txd_* are all about the DMA engine descriptors.
async_tx_* are the higher level async_tx API functions.
Thank you for the information. How about the dma_async_* functions, should they be renamed to dmaengine_* as well ? Or are some of them part of the async_tx_* API ?
Well, these:
dma_async_device_register dma_async_device_unregister dma_async_tx_descriptor_init
are more DMA engine core <-> DMA engine driver interface functions than user functions. The remainder of the dma_async_* functions are internal to the async_tx API.
I was also thinking about dma_async_issue_pending(). Isn't that function part of the DMA engine API rather than the async API ?
Hi Laurent,
On Wed, Aug 6, 2014 at 1:19 AM, Laurent Pinchart laurent.pinchart@ideasonboard.com wrote:
I actually have the opposite problem, in my case channels of physically separate DMA engines can be used interchangeably to serve the system's slaves. Using the DMA engine DT bindings, DT nodes of the slaves currently reference a specific DMA engine, even if they can be served by both. This leads to limited dynamic channel allocation capabilities (especially when taking into account lazy channel allocation as mentioned in another mail in this thread).
What about adding a property to the first one, referencing the second (or the other way around, don't know what's the easiest to implement)?
dmac0: dma-controller@e6700000 { ... renesas,alternative = <&dmac1>; ... };
dmac1: dma-controller@e6720000 { ... };
That would avoid having to bind a slave device explicitly to a single dmac, or having to bind all slave devices to all dmacs.
Gr{oetje,eeting}s,
Geert
-- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
On Wednesday 06 August 2014, Geert Uytterhoeven wrote:
I actually have the opposite problem, in my case channels of physically separate DMA engines can be used interchangeably to serve the system's slaves. Using the DMA engine DT bindings, DT nodes of the slaves currently reference a specific DMA engine, even if they can be served by both. This leads to limited dynamic channel allocation capabilities (especially when taking into account lazy channel allocation as mentioned in another mail in this thread).
What about adding a property to the first one, referencing the second (or the other way around, don't know what's the easiest to implement)?
dmac0: dma-controller@e6700000 { ... renesas,alternative = <&dmac1>; ... }; dmac1: dma-controller@e6720000 { ... };
That would avoid having to bind a slave device explicitly to a single dmac, or having to bind all slave devices to all dmacs.
We have a perfectly fine way to express this with the existing binding already: you just list channels for both (or more) controllers for each slave, and let the dma subsystem pick one. This was a compromise we reached when we initially introduced the dma slave binding, the downside being that we have to name every reference from a slave to a controller, even though almost all of them are "rx", "tx" or "data".
I believe what happened though is that the initial implementation in the kernel was to just pick the first channel for a given name and try that but give up if it fails. This works for the majority of users and I had expected someone to implement a smarter strategy as needed.
The easiest way would be to just randomize the order of the channels during lookup and then try them all, but there is a potential problem with this failing sometimes in nondeterministic ways. Another alternative would be for the dma controller to report back some form of "utilization" number to the dma subsystem and have the common code pick the least utilized engine that is connected to that slave.
Arnd
On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
Hi Vinod,
On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
The rsnd_soc_dai_trigger() function takes a spinlock, making the context atomic, which regmap doesn't like as it locks a mutex.
It might be possible to fix this by setting the fast_io field in both the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap will then use a spinlock instead of a mutex. However, even if I believe that change makes sense and should be done, another atomic context issue will come from the rcar-dmac driver which allocates memory in the prep_dma_cyclic function, called by rsnd_dma_start() from rsnd_soc_dai_trigger() with the spinlock help.
What context is the rsnd_soc_dai_trigger() function called in by the alsa core ? If it's guaranteed to be a sleepable context, would it make sense to replace the rsnd_priv spinlock with a mutex ?
Answering myself here, that's not an option, as the trigger function is called in atomic context (replacing the spinlock with a mutex produced a clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being called in atomic context, and on the other side the function ends up calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more func the DMA engine API is undocumented and completely silent on whether the prep functions are allowed to sleep. The question is, who's wrong ?
Now, if you're tempted to say that I'm wrong by allocating memory with GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem more complex to solve.
The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is suspended. The driver calls pm_runtime_get_sync() to resume the device, and needs to do so when a descriptor is submitted. This operation, currently performed in the tx_submit handler, could be moved to the prep_dma_cyclic or issue_pending handler, but the three operations are called in a row from rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the spinlock held. This means I have no place in my DMA engine driver where I can resume the device.
One could argue that the rcar-dmac driver could use a work queue to handle power management. That's correct, but the additional complexity, which would be required in *all* DMA engine drivers, seem too high to me. If we need to go that way, this is really a task that the DMA engine core should handle.
Let's start by answering the background question and updating the DMA engine API documentation once and for good : in which context are drivers allowed to call the prep, tx_submit and issue_pending functions ?
I think this was bought up sometime back and we have cleared that all _prep functions can be invoked in atomic context.
This is the reason why we have been pushing folks to use GFP_NOWAIT is memory allocations during prepare.
From the replies I've received it's pretty clear that the prep functions need to be callable from atomic context. I'll respond to this in more depth in a reply to Russell's e-mail.
Thanks for pointing out documentation doesn't say so, will send a patch for that.
I wish that was all that is missing from the documentation ;-) Luckily Maxime Ripard has sent a patch that documents DMA engine from a DMA engine driver's point of view. While not perfect (I'm going to review it), it's a nice starting point to (hopefully) get to a properly documented framework.
Sure, please do point out any other instance.
Russell did improve it a lot. But then Documentation gets not so quick updates. Yes new users like you can list issues far more easily than others.
Also now commit log provides a very valuable source of why a particular thing was done.
from atomic context too.
I'll take this opportunity to question why we have a separation between tx_submit and issue_pending. What's the rationale for that, especially given that dma_issue_pending_all() might kick in at any point and issue pending transfers for all devices. A driver could thus see its submitted but not issued transactions being issued before it explicitly calls dma_async_issue_pending().
The API states that you need to get a channel, then prepare a descriptor and submit it back. Prepare can be in any order. The submit order is the one which is run on dmaengine. The submit marks the descriptor as pending. Typically you should have a pending_list which the descriptor should be pushed to.
And lastly invoke dma_async_issue_pending() to start the pending ones.
You have the flexibility to prepare descriptors and issue in the order you like. You can also attach the callback required for descriptors here.
The DMA_PRIVATE capability flag seems to play a role here, but it's far from being clear how that mechanism is supposed to work. This should be documented as well, and any light you could shed on this dark corner of the API would help.
Ah it is not so dark.
if you look closely at dmaengine channel allocation it is only for marking if channel is privately to be used or for async_tx. Thus slave devices must set DMA_PRIVATE
Similarly, the DMA engine API is split in functions with different prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such as async_tx_ack or txd_lock. If there's a rationale for that (beyond just historical reasons) it should also be documented, otherwise a cleanup would help all the confused DMA engine users (myself included). I might be able to find a bit of time to work on that, but I'll first need to correctly understand where we come from and where we are. Again, information would be welcome and fully appreciated.
History. DMA engine was developed for async_tx. (hence async_tx)
I think most of dmaengine APIs are dmaengine_. the async_ ones are specifically for async_tx usage. txd ones are descriptor related.
HTH
Hi Vinod,
On Friday 01 August 2014 22:37:44 Vinod Koul wrote:
On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
The rsnd_soc_dai_trigger() function takes a spinlock, making the context atomic, which regmap doesn't like as it locks a mutex.
It might be possible to fix this by setting the fast_io field in both the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap will then use a spinlock instead of a mutex. However, even if I believe that change makes sense and should be done, another atomic context issue will come from the rcar-dmac driver which allocates memory in the prep_dma_cyclic function, called by rsnd_dma_start() from rsnd_soc_dai_trigger() with the spinlock help.
What context is the rsnd_soc_dai_trigger() function called in by the alsa core ? If it's guaranteed to be a sleepable context, would it make sense to replace the rsnd_priv spinlock with a mutex ?
Answering myself here, that's not an option, as the trigger function is called in atomic context (replacing the spinlock with a mutex produced a clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being called in atomic context, and on the other side the function ends up calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more func the DMA engine API is undocumented and completely silent on whether the prep functions are allowed to sleep. The question is, who's wrong ?
Now, if you're tempted to say that I'm wrong by allocating memory with GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem more complex to solve.
The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is suspended. The driver calls pm_runtime_get_sync() to resume the device, and needs to do so when a descriptor is submitted. This operation, currently performed in the tx_submit handler, could be moved to the prep_dma_cyclic or issue_pending handler, but the three operations are called in a row from rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the spinlock held. This means I have no place in my DMA engine driver where I can resume the device.
One could argue that the rcar-dmac driver could use a work queue to handle power management. That's correct, but the additional complexity, which would be required in *all* DMA engine drivers, seem too high to me. If we need to go that way, this is really a task that the DMA engine core should handle.
Let's start by answering the background question and updating the DMA engine API documentation once and for good : in which context are drivers allowed to call the prep, tx_submit and issue_pending functions ?
I think this was bought up sometime back and we have cleared that all _prep functions can be invoked in atomic context.
This is the reason why we have been pushing folks to use GFP_NOWAIT is memory allocations during prepare.
From the replies I've received it's pretty clear that the prep functions need to be callable from atomic context. I'll respond to this in more depth in a reply to Russell's e-mail.
Thanks for pointing out documentation doesn't say so, will send a patch for that.
I wish that was all that is missing from the documentation ;-) Luckily Maxime Ripard has sent a patch that documents DMA engine from a DMA engine driver's point of view. While not perfect (I'm going to review it), it's a nice starting point to (hopefully) get to a properly documented framework.
Sure, please do point out any other instance.
Don't worry, I will :-)
Russell did improve it a lot. But then Documentation gets not so quick updates. Yes new users like you can list issues far more easily than others.
Also now commit log provides a very valuable source of why a particular thing was done.
Come on. That's the theory (and we should of course aim for it). We all know the difference between theory and practice : in theory they're the same.
More seriously speaking, I've dived into the git history more times than I should have to retain some mental sanity, and it leaves way too many questions unanswered.
from atomic context too.
I'll take this opportunity to question why we have a separation between tx_submit and issue_pending. What's the rationale for that, especially given that dma_issue_pending_all() might kick in at any point and issue pending transfers for all devices. A driver could thus see its submitted but not issued transactions being issued before it explicitly calls dma_async_issue_pending().
The API states that you need to get a channel, then prepare a descriptor and submit it back. Prepare can be in any order. The submit order is the one which is run on dmaengine. The submit marks the descriptor as pending. Typically you should have a pending_list which the descriptor should be pushed to.
And lastly invoke dma_async_issue_pending() to start the pending ones.
You have the flexibility to prepare descriptors and issue in the order you like. You can also attach the callback required for descriptors here.
The question was why is there a dma_async_issue_pending() operation at all ? Why can't dmaengine_submit() triggers the transfer start ? The only explanation is a small comment in dmaengine.h that states
* This allows drivers to push copies to HW in batches, * reducing MMIO writes where possible.
I don't think that's applicable for DMA slave transfers. Is it still applicable for anything else ?
Quoting git log, the reason is
commit c13c8260da3155f2cefb63b0d1b7dcdcb405c644 Author: Chris Leech christopher.leech@intel.com Date: Tue May 23 17:18:44 2006 -0700
[I/OAT]: DMA memcpy subsystem
Provides an API for offloading memory copies to DMA devices
Signed-off-by: Chris Leech christopher.leech@intel.com Signed-off-by: David S. Miller davem@davemloft.net
;-)
The DMA_PRIVATE capability flag seems to play a role here, but it's far from being clear how that mechanism is supposed to work. This should be documented as well, and any light you could shed on this dark corner of the API would help.
Ah it is not so dark.
if you look closely at dmaengine channel allocation it is only for marking if channel is privately to be used or for async_tx. Thus slave devices must set DMA_PRIVATE
In order to avoid scattering one topic across multiple mails, I'll pursue this one in a reply to Russell.
Similarly, the DMA engine API is split in functions with different prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such as async_tx_ack or txd_lock. If there's a rationale for that (beyond just historical reasons) it should also be documented, otherwise a cleanup would help all the confused DMA engine users (myself included). I might be able to find a bit of time to work on that, but I'll first need to correctly understand where we come from and where we are. Again, information would be welcome and fully appreciated.
History. DMA engine was developed for async_tx. (hence async_tx)
I think most of dmaengine APIs are dmaengine_. the async_ ones are specifically for async_tx usage. txd ones are descriptor related.
Ditto here.
On 08/04/2014 06:50 PM, Laurent Pinchart wrote: [...]
from atomic context too.
I'll take this opportunity to question why we have a separation between tx_submit and issue_pending. What's the rationale for that, especially given that dma_issue_pending_all() might kick in at any point and issue pending transfers for all devices. A driver could thus see its submitted but not issued transactions being issued before it explicitly calls dma_async_issue_pending().
The API states that you need to get a channel, then prepare a descriptor and submit it back. Prepare can be in any order. The submit order is the one which is run on dmaengine. The submit marks the descriptor as pending. Typically you should have a pending_list which the descriptor should be pushed to.
And lastly invoke dma_async_issue_pending() to start the pending ones.
You have the flexibility to prepare descriptors and issue in the order you like. You can also attach the callback required for descriptors here.
The question was why is there a dma_async_issue_pending() operation at all ? Why can't dmaengine_submit() triggers the transfer start ? The only explanation is a small comment in dmaengine.h that states
- This allows drivers to push copies to HW in batches,
- reducing MMIO writes where possible.
I don't think that's applicable for DMA slave transfers. Is it still applicable for anything else ?
[...]
If the hardware has scatter gather support it allows the driver to chain the descriptors before submitting them, which reduces the latency between the transfers as well as the IO over overhead. The flaw with the current implementation is that there is only one global chain per channel instead of e.g. having the possibility to build up a chain in a driver and then submit and start the chain. Some drivers have virtual channels where each channel basically acts as the chain and once issue pending is called it is the chain is mapped to a real channel which then executes it.
- Lars
On Mon, Aug 04, 2014 at 08:03:45PM +0200, Lars-Peter Clausen wrote:
If the hardware has scatter gather support it allows the driver to chain the descriptors before submitting them, which reduces the latency between the transfers as well as the IO over overhead.
While partially true, that's not the full story...
BTW, you're talking about stuff in DMA engine not being clear, but you're using confusing terminology. Descriptors vs transactions. The prepare functions return a transaction. Descriptors are the hardware data structures which describe the transaction. I'll take what you're talking about above as "chain the previous transaction descriptors to the next transaction descriptors".
The flaw with the current implementation is that there is only one global chain per channel instead of e.g. having the possibility to build up a chain in a driver and then submit and start the chain. Some drivers have virtual channels where each channel basically acts as the chain and once issue pending is called it is the chain is mapped to a real channel which then executes it.
Most DMA engines are unable to program anything except the parameters for the next stage of the transfer. In order to switch between "channels", many DMA engine implementations need the help of the CPU to reprogram the physical channel configuration. Chaining two different channels which may ultimately end up on the same physical channel would be a bug in that case.
Where the real flaw exists is the way that a lot of people write their DMA engine drivers - in particular how they deal with the end of a transfer.
Many driver implementations receive an interrupt from the DMA controller, and either queue a tasklet, or they check the existing transfer, mark it as completed in some way, and queue a tasklet.
When the tasklet runs, they then look to see if there's another transfer which they can start, and they then start it.
That is horribly inefficient - it is much better to do all the DMA manipulation in IRQ context. So, when the channel completes the existing transfer, you move the transaction to the queue of completed transfers and queue the tasklet, check whether there's a transaction for the same channel pending, and if so, start it immediately.
This means that your inter-transfer gap is reduced down from the interrupt latency plus tasklet latency, to just the interrupt latency.
Controllers such as OMAP (if their hardware scatter chains were used) do have the ability to reprogram the entire channel configuration from an appropriate transaction, and so /could/ start the next transfer entirely automatically - but I never added support for the hardware scatterlists as I have been told that TI measurements indicated that it did not gain any performance to use them. Had this been implemented, it would mean that OMAP would only need to issue an interrupt to notify completion of a transfer (so the driver would only have to work out how many dma transactions had been completed.)
In this case, it is important that we do batch up the entries (since an already in progress descriptor should not be modified), but I suspect in the case of slave DMA, it is rarely the case that there is more than one or two descriptors queued at any moment.
Hi Russell,
On Monday 04 August 2014 19:32:25 Russell King - ARM Linux wrote:
On Mon, Aug 04, 2014 at 08:03:45PM +0200, Lars-Peter Clausen wrote:
If the hardware has scatter gather support it allows the driver to chain the descriptors before submitting them, which reduces the latency between the transfers as well as the IO over overhead.
While partially true, that's not the full story...
BTW, you're talking about stuff in DMA engine not being clear, but you're using confusing terminology. Descriptors vs transactions. The prepare functions return a transaction. Descriptors are the hardware data structures which describe the transaction. I'll take what you're talking about above as "chain the previous transaction descriptors to the next transaction descriptors".
Well, the prep_* functions return a struct dma_async_tx_descriptor, documented as an "async transaction descriptor".
There are several types of descriptors, transactions and transfers involved, with different names depending on where you look at.
- At the highest level, we have the DMA engine representation of a transaction in the form of a struct dma_async_tx_descriptor (even this is slightly misleading, as tx is a common abbreviation of transmit or transmission, but not of transaction).
- One level lower, when the high level transaction targets non contiguous memory (from the device point of view) the transaction is split into contiguous chunks. The device might be able to execute a list (or table, depending of the implementation) of chunks on its own without requiring software intervention. If it isn't the driver will need to submit the next chunk in the completion interrupt of the previous chunk. Even when the device supports executing multiple chunks on its own, it might be limited in the number of chunks it can chain, requiring software intervention to handle one transaction descriptor.
- At the lowest level, the hardware will perform the transfer by repeating transfer cycles, reading a data unit from the source and writing to the destination. When the source or destination supports it, the read and/or write operations can also be grouped in bursts.
If we want to lower the confusion we should decide on names for those different levels and stick to them.
The highest level unit is called a transaction by (at least some parts of) the API, the name sounds good enough to me. "Transaction" could thus refer to that operation, and "transaction descriptor" to the struct dma_async_tx_descriptor instance.
We could then say that a transaction is split into transfers, each of them targeting a contiguous piece of memory of both the source and the destination, and that transfers are split into transfer cycles, each of them transferring one data unit or element. I'm also open to other proposals (including using the name "chunk" for one of the elements).
The flaw with the current implementation is that there is only one global chain per channel instead of e.g. having the possibility to build up a chain in a driver and then submit and start the chain.
Well, that's not completely true, the API supports scatterlists, so you could create a single transaction descriptor that spans several unrelated transfers (as long as they can use the same channel, for instance targeting the same device for slave transactions).
Some drivers have virtual channels where each channel basically acts as the chain and once issue pending is called it is the chain is mapped to a real channel which then executes it.
Most DMA engines are unable to program anything except the parameters for the next stage of the transfer. In order to switch between "channels", many DMA engine implementations need the help of the CPU to reprogram the physical channel configuration. Chaining two different channels which may ultimately end up on the same physical channel would be a bug in that case.
I'm mostly familiar with DMA engines designed for slave transfers. The ones I've seen have channels that are programmed and run independently, usually with some logic to arbitrate bus access. When they support executing lists or arrays of transfers the hardware transfer descriptors include the source and destination addresses and the number of elements to be transfered. The identifier of the slave device (basically the DMA request line to which the slave is connected) is constant across all chained transfers.
I'm not sure what you mean by "switching between channels". Could you please explain that ?
Where the real flaw exists is the way that a lot of people write their DMA engine drivers - in particular how they deal with the end of a transfer.
Many driver implementations receive an interrupt from the DMA controller, and either queue a tasklet, or they check the existing transfer, mark it as completed in some way, and queue a tasklet.
When the tasklet runs, they then look to see if there's another transfer which they can start, and they then start it.
That is horribly inefficient - it is much better to do all the DMA manipulation in IRQ context. So, when the channel completes the existing transfer, you move the transaction to the queue of completed transfers and queue the tasklet, check whether there's a transaction for the same channel pending, and if so, start it immediately.
This means that your inter-transfer gap is reduced down from the interrupt latency plus tasklet latency, to just the interrupt latency.
I totally agree. This should be documented to avoid this kind of mistake in the future. Maxime, if you can find time for it, could you add this to the next version of your documentation patch ?
Controllers such as OMAP (if their hardware scatter chains were used) do have the ability to reprogram the entire channel configuration from an appropriate transaction, and so /could/ start the next transfer entirely automatically - but I never added support for the hardware scatterlists as I have been told that TI measurements indicated that it did not gain any performance to use them. Had this been implemented, it would mean that OMAP would only need to issue an interrupt to notify completion of a transfer (so the driver would only have to work out how many dma transactions had been completed.)
In this case, it is important that we do batch up the entries (since an already in progress descriptor should not be modified), but I suspect in the case of slave DMA, it is rarely the case that there is more than one or two descriptors queued at any moment.
I agree, in most cases there's only one or a few transaction descriptors queued for slave DMA. There could then be a larger number of hardware transfer descriptors to represent one transaction descriptor, but those would have been created by the DMA engine driver from a single transaction descriptor, so there would be no problem chaining the transfers.
How about the memcpy (non-slave) DMA ? Do client drivers submit lots of small DMA transactions that should be chained for optimal performances ?
On Mon, Aug 04, 2014 at 06:50:17PM +0200, Laurent Pinchart wrote:
The question was why is there a dma_async_issue_pending() operation at all ? Why can't dmaengine_submit() triggers the transfer start ? The only explanation is a small comment in dmaengine.h that states
- This allows drivers to push copies to HW in batches,
- reducing MMIO writes where possible.
I don't think that's applicable for DMA slave transfers. Is it still applicable for anything else ?
why not?
If your hw supports sg-lists and say length of 8 and you prepare two descriptors for lengths of 3 and 5. While in issue pending what prevents you from submiiting them in one shot to hardware while still getting interrupt.
I know designware and intel-dma do support that. It is different point that drivers don't
On 07/23/2014 01:07 PM, Laurent Pinchart wrote: [...]
Let's start by answering the background question and updating the DMA engine API documentation once and for good : in which context are drivers allowed to call the prep, tx_submit and issue_pending functions ?
I think the expectation is that these functions can be called in any context. Maybe what's missing is a way to tell the DMA engine driver to get ready and that it is going to be used very soon. This could be done from the sound devices open() callback.
- Lars
On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being called in atomic context, and on the other side the function ends up calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more func the DMA engine API is undocumented and completely silent on whether the prep functions are allowed to sleep. The question is, who's wrong ?
For slave DMA drivers, there is the expectation that the prepare functions will be callable from tasklet context, without any locks held by the driver. So, it's expected that the prepare functions will work from tasklet context.
I don't think we've ever specified whether they should be callable from interrupt context, but in practice, we have drivers which do exactly that, so I think the decision has already been made - they will be callable from IRQ context, and so GFP_ATOMIC is required in the driver.
The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is suspended. The driver calls pm_runtime_get_sync() to resume the device, and needs to do so when a descriptor is submitted. This operation, currently performed in the tx_submit handler, could be moved to the prep_dma_cyclic or issue_pending handler, but the three operations are called in a row from rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the spinlock held. This means I have no place in my DMA engine driver where I can resume the device.
Right, runtime PM with DMA engine drivers is hard. The best that can be done right now is to pm_runtime_get() in the alloc_chan_resources() method and put it in free_chan_resources() if you don't want to do the workqueue thing.
There's a problem with the workqueue thing though - by doing so, you make it asynchronous to the starting of the DMA. The DMA engine API allows for delayed starting (it's actually the normal thing for DMA engine), but that may not always be appropriate or desirable.
One could argue that the rcar-dmac driver could use a work queue to handle power management. That's correct, but the additional complexity, which would be required in *all* DMA engine drivers, seem too high to me. If we need to go that way, this is really a task that the DMA engine core should handle.
As I mention above, the problem with that is getting the workqueue to run soon enough that it doesn't cause a performance degredation or other issues.
There's also expectations from other code - OMAP for example explicitly needs DMA to be started on the hardware before the audio block can be enabled (from what I remember, it tickless an erratum if this is not done.)
Let's start by answering the background question and updating the DMA engine API documentation once and for good : in which context are drivers allowed to call the prep, tx_submit and issue_pending functions ?
IRQs-off contexts. :)
Hi Russell,
On Thursday 24 July 2014 13:51:21 Russell King - ARM Linux wrote:
On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being called in atomic context, and on the other side the function ends up calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more func the DMA engine API is undocumented and completely silent on whether the prep functions are allowed to sleep. The question is, who's wrong ?
For slave DMA drivers, there is the expectation that the prepare functions will be callable from tasklet context, without any locks held by the driver. So, it's expected that the prepare functions will work from tasklet context.
I don't think we've ever specified whether they should be callable from interrupt context, but in practice, we have drivers which do exactly that, so I think the decision has already been made - they will be callable from IRQ context, and so GFP_ATOMIC is required in the driver.
I agree with you, the decision has made itself to use allocation routines that will not sleep. However,
$ grep -r GFP_NOWAIT drivers/dma | wc -l 22 $ grep -r GFP_ATOMIC drivers/dma | wc -l 24
Looks like a draw to me :-) I'm pretty sure most of the decisions to use GFP_NOWAIT or GFP_ATOMIC are cases of cargo-cult programming instead of resulting from a thoughtful process. We should document this in Maxime's new DMA engine internals documentation.
This being said, I wonder whether allowing the prep functions to be called in atomic context was a sane decision. We've pushed the whole DMA engine API to atomic context, leading to such niceties as memory allocation pressure and the impossibility to implement runtime PM support without resorting to a work queue for the sole purpose of power management, in *every* DMA engine driver. I can't help but thinking something is broken.
I'd like your opinion on that. Whether fixing it would be worth the effort is a different question, so we could certainly conclude that we'll have to live with an imperfect design for the time being.
The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is suspended. The driver calls pm_runtime_get_sync() to resume the device, and needs to do so when a descriptor is submitted. This operation, currently performed in the tx_submit handler, could be moved to the prep_dma_cyclic or issue_pending handler, but the three operations are called in a row from rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the spinlock held. This means I have no place in my DMA engine driver where I can resume the device.
Right, runtime PM with DMA engine drivers is hard. The best that can be done right now is to pm_runtime_get() in the alloc_chan_resources() method and put it in free_chan_resources() if you don't want to do the workqueue thing.
That's an easy enough implementation, but given that channels are usually requested at probe time and released at remove time, that would be roughly equivalent to no PM at all.
There's a problem with the workqueue thing though - by doing so, you make it asynchronous to the starting of the DMA. The DMA engine API allows for delayed starting (it's actually the normal thing for DMA engine), but that may not always be appropriate or desirable.
If I'm not mistaken, the DMA engine API doesn't provide a way to synchronously start DMA transfers. An implementation could do so, but no guarantee is offered by the API to the caller. I agree, however, that it could be an issue.
Doesn't this call for a new pair of open/close-like functions in the API ? They would be called right before a client driver starts using a channel, and right after it stops using it. Those functions would be allowed to sleep.
Beside simplifying power management, those functions could also be used for lazy hardware channel allocation. Renesas SoCs have DMA engines that include general-purpose channels, usable by any slave. There are more slaves than hardware channels, so not all slaves can be used at the same time. At the moment the hardware channel is allocated when requesting the DMA engine channel, at probe time for most slave drivers. This breaks when too many slaves get registered, even if they're not all used at the same time. Some kind of lazy/delayed allocation scheme would be useful.
Another option would be to request the DMA engine channel later than probe time, when the channel will actually be used. However, that would break deferred probing, and would possibly degrade performances.
One could argue that the rcar-dmac driver could use a work queue to handle power management. That's correct, but the additional complexity, which would be required in *all* DMA engine drivers, seem too high to me. If we need to go that way, this is really a task that the DMA engine core should handle.
As I mention above, the problem with that is getting the workqueue to run soon enough that it doesn't cause a performance degredation or other issues.
That's why I liked the ability to sleep in the prep functions ;-)
There's also expectations from other code - OMAP for example explicitly needs DMA to be started on the hardware before the audio block can be enabled (from what I remember, it tickless an erratum if this is not done.)
Nice. That pretty much bans the usage of a workqueue then, we need something else.
Let's start by answering the background question and updating the DMA engine API documentation once and for good : in which context are drivers allowed to call the prep, tx_submit and issue_pending functions ?
IRQs-off contexts. :)
participants (7)
-
Arnd Bergmann
-
Geert Uytterhoeven
-
Kuninori Morimoto
-
Lars-Peter Clausen
-
Laurent Pinchart
-
Russell King - ARM Linux
-
Vinod Koul