Poor performace on mmap reading arm64 on audio device

newer
[PATCH 01/38] ASoC: ak5558: drop...

Michael Nazzareno Trimarchi

21 Nov 2020 21 Nov '20

10:40 a.m.

Hi all

I'm trying to figure out how to increase performance on audio reading using the mmap interface. Right now what I understand it's that allocation comes from core/memalloc.c ops that allocate the memory for dma under driver/dma. The reference platform I have is an imx8mm and the allocation in arm64 is:

0xffff800011ff5000-0xffff800012005000 64K PTE RW NX SHD AF UXN MEM/NORMAL-NC

This is the reason that is allocated for dma interface.

Now access linear on the multichannel interface the performance is bad but worse if I try to access a channel a time on read. So it looks like it is better to copy the block using memcpy on a cached area and then operate on a single channel sample. If it's correct what I'm saying the mmap_begin and mmap_commit basically they don't do anything on cache level so the page mapping and way is used is always the same. Can the interface be modified to allow cache the area during read and restore in the commit phase?

Michael

Show replies by date

Takashi Iwai

23 Nov 23 Nov

2:23 p.m.

On Sat, 21 Nov 2020 10:40:04 +0100, Michael Nazzareno Trimarchi wrote:

...

Hi all

I'm trying to figure out how to increase performance on audio reading using the mmap interface. Right now what I understand it's that allocation comes from core/memalloc.c ops that allocate the memory for dma under driver/dma. The reference platform I have is an imx8mm and the allocation in arm64 is:

0xffff800011ff5000-0xffff800012005000 64K PTE RW NX SHD AF UXN MEM/NORMAL-NC

This is the reason that is allocated for dma interface.

Now access linear on the multichannel interface the performance is bad but worse if I try to access a channel a time on read. So it looks like it is better to copy the block using memcpy on a cached area and then operate on a single channel sample. If it's correct what I'm saying the mmap_begin and mmap_commit basically they don't do anything on cache level so the page mapping and way is used is always the same. Can the interface be modified to allow cache the area during read and restore in the commit phase?

The current API of the mmap for the sound ring-buffer is designed to allow concurrent accesses at any time in the minimalistic kernel-user context switching. So the whole buffer is allocated as coherent and mmapped in a shot. It's pretty efficient for architectures like x86, but has disadvantages on ARM, indeed.

The mmap_begin and mmap_commit are the concepts in the alsa-lib side for supporting the plugins better, and they doesn't represent kernel ABI. So, this extension would be needed at first, and the memory allocation mechanism has to be changed as well. Last but not least, the concurrency has to be reconsidered if this approach is taken.

That said, it's possible in theory, but practically no trivial task.

thanks,

Takashi

Michael Nazzareno Trimarchi

2:44 p.m.

On Mon, Nov 23, 2020 at 2:23 PM Takashi Iwai tiwai@suse.de wrote:

...

On Sat, 21 Nov 2020 10:40:04 +0100, Michael Nazzareno Trimarchi wrote:

...
Hi all

I'm trying to figure out how to increase performance on audio reading using the mmap interface. Right now what I understand it's that allocation comes from core/memalloc.c ops that allocate the memory for dma under driver/dma. The reference platform I have is an imx8mm and the allocation in arm64 is:

0xffff800011ff5000-0xffff800012005000 64K PTE RW NX SHD AF UXN MEM/NORMAL-NC

This is the reason that is allocated for dma interface.

Now access linear on the multichannel interface the performance is bad but worse if I try to access a channel a time on read. So it looks like it is better to copy the block using memcpy on a cached area and then operate on a single channel sample. If it's correct what I'm saying the mmap_begin and mmap_commit basically they don't do anything on cache level so the page mapping and way is used is always the same. Can the interface be modified to allow cache the area during read and restore in the commit phase?

The current API of the mmap for the sound ring-buffer is designed to allow concurrent accesses at any time in the minimalistic kernel-user context switching. So the whole buffer is allocated as coherent and mmapped in a shot. It's pretty efficient for architectures like x86, but has disadvantages on ARM, indeed.

Each platform e/o architecture can specialize the mmap and declare the area that is consistent in dma to me mapped as no cache one

vma->vm_page_prot = pgprot_cached(vma->vm_page_prot); return remap_pfn_range(vma, vma->vm_start, vma->vm_end - vma->vm_start, vma->vm_page_prot);

I have done it for testing purposes. This give an idea

- read multi channel not sequentially took around 12% of the cpu with mmap interface - read multi channel use after a memcpy took around 6% - read on a cached area took around 3%. I'm trying to figure out how and when invalidate the area

I have two use cases: - write on the channels (no performance issue) - read on channels

Before reading I should only say that the cached area is not in sync with memory. I think that supporting write use cases makes little sense here.

...

The mmap_begin and mmap_commit are the concepts in the alsa-lib side for supporting the plugins better, and they doesn't represent kernel ABI. So, this extension would be needed at first, and the memory allocation mechanism has to be changed as well. Last but not least,

Are you sure about memory allocation, or just memory mapping?

...

the concurrency has to be reconsidered if this approach is taken.

Yes I know that is a big problem anyway. I don't have a big idea how solve it

Michael

...

That said, it's possible in theory, but practically no trivial task.

thanks,

Takashi

-- Michael Nazzareno Trimarchi Amarula Solutions BV COO Co-Founder Cruquiuskade 47 Amsterdam 1018 AM NL T. +31(0)851119172 M. +39(0)3479132170 [`as] https://www.amarulasolutions.com

Takashi Iwai

2:54 p.m.

On Mon, 23 Nov 2020 14:44:52 +0100, Michael Nazzareno Trimarchi wrote:

...

Hi

On Mon, Nov 23, 2020 at 2:23 PM Takashi Iwai tiwai@suse.de wrote:

...
On Sat, 21 Nov 2020 10:40:04 +0100, Michael Nazzareno Trimarchi wrote:

...
Hi all

I'm trying to figure out how to increase performance on audio reading using the mmap interface. Right now what I understand it's that allocation comes from core/memalloc.c ops that allocate the memory for dma under driver/dma. The reference platform I have is an imx8mm and the allocation in arm64 is:

0xffff800011ff5000-0xffff800012005000 64K PTE RW NX SHD AF UXN MEM/NORMAL-NC

This is the reason that is allocated for dma interface.

Now access linear on the multichannel interface the performance is bad but worse if I try to access a channel a time on read. So it looks like it is better to copy the block using memcpy on a cached area and then operate on a single channel sample. If it's correct what I'm saying the mmap_begin and mmap_commit basically they don't do anything on cache level so the page mapping and way is used is always the same. Can the interface be modified to allow cache the area during read and restore in the commit phase?

The current API of the mmap for the sound ring-buffer is designed to allow concurrent accesses at any time in the minimalistic kernel-user context switching. So the whole buffer is allocated as coherent and mmapped in a shot. It's pretty efficient for architectures like x86, but has disadvantages on ARM, indeed.

Each platform e/o architecture can specialize the mmap and declare the area that is consistent in dma to me mapped as no cache one

vma->vm_page_prot = pgprot_cached(vma->vm_page_prot); return remap_pfn_range(vma, vma->vm_start, vma->vm_end - vma->vm_start, vma->vm_page_prot);

I have done it for testing purposes. This give an idea

read multi channel not sequentially took around 12% of the cpu with

mmap interface

read multi channel use after a memcpy took around 6%

read on a cached area took around 3%. I'm trying to figure out how

and when invalidate the area

I have two use cases:

write on the channels (no performance issue)

read on channels

Before reading I should only say that the cached area is not in sync with memory. I think that supporting write use cases makes little sense here.

It's a necessary use case, unfortunately. The reason we ended up with one device per direction for the PCM in many many years ago was that some applications need to write the buffers for marking even for the read. So it can't be read-only, and it's supposed to be coherent on both read and write -- as long as keeping the current API usage.

...

...
The mmap_begin and mmap_commit are the concepts in the alsa-lib side for supporting the plugins better, and they doesn't represent kernel ABI. So, this extension would be needed at first, and the memory allocation mechanism has to be changed as well. Last but not least,

Are you sure about memory allocation, or just memory mapping?

I thought you'd need the proper memory allocation for the coherent mmap?

...

...
the concurrency has to be reconsidered if this approach is taken.

Yes I know that is a big problem anyway. I don't have a big idea how solve it

If you find a good solution, let us know. It's a kind of historical obstacle, but certainly it's solvable.

Takashi

Michael Nazzareno Trimarchi

3:19 p.m.

On Mon, Nov 23, 2020 at 2:54 PM Takashi Iwai tiwai@suse.de wrote:

...

On Mon, 23 Nov 2020 14:44:52 +0100, Michael Nazzareno Trimarchi wrote:

...
Hi

On Mon, Nov 23, 2020 at 2:23 PM Takashi Iwai tiwai@suse.de wrote:

...
On Sat, 21 Nov 2020 10:40:04 +0100, Michael Nazzareno Trimarchi wrote:

...
Hi all

I'm trying to figure out how to increase performance on audio reading using the mmap interface. Right now what I understand it's that allocation comes from core/memalloc.c ops that allocate the memory for dma under driver/dma. The reference platform I have is an imx8mm and the allocation in arm64 is:

0xffff800011ff5000-0xffff800012005000 64K PTE RW NX SHD AF UXN MEM/NORMAL-NC

This is the reason that is allocated for dma interface.

Now access linear on the multichannel interface the performance is bad but worse if I try to access a channel a time on read. So it looks like it is better to copy the block using memcpy on a cached area and then operate on a single channel sample. If it's correct what I'm saying the mmap_begin and mmap_commit basically they don't do anything on cache level so the page mapping and way is used is always the same. Can the interface be modified to allow cache the area during read and restore in the commit phase?

The current API of the mmap for the sound ring-buffer is designed to allow concurrent accesses at any time in the minimalistic kernel-user context switching. So the whole buffer is allocated as coherent and mmapped in a shot. It's pretty efficient for architectures like x86, but has disadvantages on ARM, indeed.

Each platform e/o architecture can specialize the mmap and declare the area that is consistent in dma to me mapped as no cache one

vma->vm_page_prot = pgprot_cached(vma->vm_page_prot); return remap_pfn_range(vma, vma->vm_start, vma->vm_end - vma->vm_start, vma->vm_page_prot);

I have done it for testing purposes. This give an idea

read multi channel not sequentially took around 12% of the cpu with

mmap interface

read multi channel use after a memcpy took around 6%

read on a cached area took around 3%. I'm trying to figure out how

and when invalidate the area

I have two use cases:

write on the channels (no performance issue)

read on channels

Before reading I should only say that the cached area is not in sync with memory. I think that supporting write use cases makes little sense here.

It's a necessary use case, unfortunately. The reason we ended up with one device per direction for the PCM in many many years ago was that some applications need to write the buffers for marking even for the read. So it can't be read-only, and it's supposed to be coherent on both read and write -- as long as keeping the current API usage.

If I understand the allocation of the dma buffer depends on the direction. Each device allocate one dma_buffer for tx device and one dma buffer for rx device

@@ -105,10 +105,16 @@ static int imx_pcm_preallocate_dma_buffer(struct snd_pcm_substream *substream, size_t size = imx_pcm_hardware.buffer_bytes_max; int ret;

- ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_IRAM, - dev, - size, - &substream->dma_buffer); + if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK) + ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV, + dev, + size, + &substream->dma_buffer); + else + ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_IRAM, + dev, + size, + &substream->dma_buffer); if (ret) return ret;

Just a snippet from me, on some of my testing. How the physical memory is used by the kernel is nothing to do in how the memory is then mapped by the userspace to read from it. If I allocate it consistente in snd_dma_alloc_pages then you can let the user remap the area as cached in his own virtual mapping. What I'm trying to said is that behind the scene everything is consistent but the user will get a cache line read during the first access and then he/she will read from the cache. Maybe is this assumption is totally wrong?

Michael

...

...
...
The mmap_begin and mmap_commit are the concepts in the alsa-lib side for supporting the plugins better, and they doesn't represent kernel ABI. So, this extension would be needed at first, and the memory allocation mechanism has to be changed as well. Last but not least,

Are you sure about memory allocation, or just memory mapping?

I thought you'd need the proper memory allocation for the coherent mmap?

...
...
the concurrency has to be reconsidered if this approach is taken.

Yes I know that is a big problem anyway. I don't have a big idea how solve it

If you find a good solution, let us know. It's a kind of historical obstacle, but certainly it's solvable.

Takashi

-- Michael Nazzareno Trimarchi Amarula Solutions BV COO Co-Founder Cruquiuskade 47 Amsterdam 1018 AM NL T. +31(0)851119172 M. +39(0)3479132170 [`as] https://www.amarulasolutions.com

Takashi Iwai

3:28 p.m.

On Mon, 23 Nov 2020 15:19:34 +0100, Michael Nazzareno Trimarchi wrote:

...

Hi

On Mon, Nov 23, 2020 at 2:54 PM Takashi Iwai tiwai@suse.de wrote:

...
On Mon, 23 Nov 2020 14:44:52 +0100, Michael Nazzareno Trimarchi wrote:

...
Hi

On Mon, Nov 23, 2020 at 2:23 PM Takashi Iwai tiwai@suse.de wrote:

...
On Sat, 21 Nov 2020 10:40:04 +0100, Michael Nazzareno Trimarchi wrote:

...
Hi all

I'm trying to figure out how to increase performance on audio reading using the mmap interface. Right now what I understand it's that allocation comes from core/memalloc.c ops that allocate the memory for dma under driver/dma. The reference platform I have is an imx8mm and the allocation in arm64 is:

0xffff800011ff5000-0xffff800012005000 64K PTE RW NX SHD AF UXN MEM/NORMAL-NC

This is the reason that is allocated for dma interface.

Now access linear on the multichannel interface the performance is bad but worse if I try to access a channel a time on read. So it looks like it is better to copy the block using memcpy on a cached area and then operate on a single channel sample. If it's correct what I'm saying the mmap_begin and mmap_commit basically they don't do anything on cache level so the page mapping and way is used is always the same. Can the interface be modified to allow cache the area during read and restore in the commit phase?

The current API of the mmap for the sound ring-buffer is designed to allow concurrent accesses at any time in the minimalistic kernel-user context switching. So the whole buffer is allocated as coherent and mmapped in a shot. It's pretty efficient for architectures like x86, but has disadvantages on ARM, indeed.

Each platform e/o architecture can specialize the mmap and declare the area that is consistent in dma to me mapped as no cache one

vma->vm_page_prot = pgprot_cached(vma->vm_page_prot); return remap_pfn_range(vma, vma->vm_start, vma->vm_end - vma->vm_start, vma->vm_page_prot);

I have done it for testing purposes. This give an idea

read multi channel not sequentially took around 12% of the cpu with

mmap interface

read multi channel use after a memcpy took around 6%

read on a cached area took around 3%. I'm trying to figure out how

and when invalidate the area

I have two use cases:

write on the channels (no performance issue)

read on channels

Before reading I should only say that the cached area is not in sync with memory. I think that supporting write use cases makes little sense here.

It's a necessary use case, unfortunately. The reason we ended up with one device per direction for the PCM in many many years ago was that some applications need to write the buffers for marking even for the read. So it can't be read-only, and it's supposed to be coherent on both read and write -- as long as keeping the current API usage.

If I understand the allocation of the dma buffer depends on the direction. Each device allocate one dma_buffer for tx device and one dma buffer for rx device

@@ -105,10 +105,16 @@ static int imx_pcm_preallocate_dma_buffer(struct snd_pcm_substream *substream, size_t size = imx_pcm_hardware.buffer_bytes_max; int ret;
  ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_IRAM,
                            dev,
                            size,
                            &substream->dma_buffer);
  if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK)
          ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV,
                                    dev,
                                    size,
                                    &substream->dma_buffer);
  else
          ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_IRAM,
                                    dev,
                                    size,
                                    &substream->dma_buffer);
  if (ret)
          return ret;
Just a snippet from me, on some of my testing. How the physical memory is used by the kernel is nothing to do in how the memory is then mapped by the userspace to read from it. If I allocate it consistente in snd_dma_alloc_pages then you can let the user remap the area as cached in his own virtual mapping. What I'm trying to said is that behind the scene everything is consistent but the user will get a cache line read during the first access and then he/she will read from the cache. Maybe is this assumption is totally wrong?

Ah I see your point now. I believe that this kind of mapping tweak could be done, but this doesn't satisfy the expectation of the mmap of the current sound API; e.g. dmix / dsnoop would fail. So, if any, this should be an extension for some special usages.

My original idea was to totally go away from the coherent allocation and mapping, but just let dynamically syncing like other drivers do (e.g. net devices), aligned with mmap_begin/mmap_commit in alsa-lib.

Takashi

Michael Nazzareno Trimarchi

4:15 p.m.

On Mon, Nov 23, 2020 at 3:28 PM Takashi Iwai tiwai@suse.de wrote:

...

On Mon, 23 Nov 2020 15:19:34 +0100, Michael Nazzareno Trimarchi wrote:

...
Hi

On Mon, Nov 23, 2020 at 2:54 PM Takashi Iwai tiwai@suse.de wrote:

...
On Mon, 23 Nov 2020 14:44:52 +0100, Michael Nazzareno Trimarchi wrote:

...
Hi

On Mon, Nov 23, 2020 at 2:23 PM Takashi Iwai tiwai@suse.de wrote:

...
On Sat, 21 Nov 2020 10:40:04 +0100, Michael Nazzareno Trimarchi wrote:

...
Hi all

I'm trying to figure out how to increase performance on audio reading using the mmap interface. Right now what I understand it's that allocation comes from core/memalloc.c ops that allocate the memory for dma under driver/dma. The reference platform I have is an imx8mm and the allocation in arm64 is:

0xffff800011ff5000-0xffff800012005000 64K PTE RW NX SHD AF UXN MEM/NORMAL-NC

This is the reason that is allocated for dma interface.

Now access linear on the multichannel interface the performance is bad but worse if I try to access a channel a time on read. So it looks like it is better to copy the block using memcpy on a cached area and then operate on a single channel sample. If it's correct what I'm saying the mmap_begin and mmap_commit basically they don't do anything on cache level so the page mapping and way is used is always the same. Can the interface be modified to allow cache the area during read and restore in the commit phase?

The current API of the mmap for the sound ring-buffer is designed to allow concurrent accesses at any time in the minimalistic kernel-user context switching. So the whole buffer is allocated as coherent and mmapped in a shot. It's pretty efficient for architectures like x86, but has disadvantages on ARM, indeed.

Each platform e/o architecture can specialize the mmap and declare the area that is consistent in dma to me mapped as no cache one

vma->vm_page_prot = pgprot_cached(vma->vm_page_prot); return remap_pfn_range(vma, vma->vm_start, vma->vm_end - vma->vm_start, vma->vm_page_prot);

I have done it for testing purposes. This give an idea

read multi channel not sequentially took around 12% of the cpu with

mmap interface

read multi channel use after a memcpy took around 6%

read on a cached area took around 3%. I'm trying to figure out how

and when invalidate the area

I have two use cases:

write on the channels (no performance issue)

read on channels

Before reading I should only say that the cached area is not in sync with memory. I think that supporting write use cases makes little sense here.

It's a necessary use case, unfortunately. The reason we ended up with one device per direction for the PCM in many many years ago was that some applications need to write the buffers for marking even for the read. So it can't be read-only, and it's supposed to be coherent on both read and write -- as long as keeping the current API usage.

If I understand the allocation of the dma buffer depends on the direction. Each device allocate one dma_buffer for tx device and one dma buffer for rx device

@@ -105,10 +105,16 @@ static int imx_pcm_preallocate_dma_buffer(struct snd_pcm_substream *substream, size_t size = imx_pcm_hardware.buffer_bytes_max; int ret;
  ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_IRAM,
                            dev,
                            size,
                            &substream->dma_buffer);
  if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK)
          ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV,
                                    dev,
                                    size,
                                    &substream->dma_buffer);
  else
          ret = snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_IRAM,
                                    dev,
                                    size,
                                    &substream->dma_buffer);
  if (ret)
          return ret;
Just a snippet from me, on some of my testing. How the physical memory is used by the kernel is nothing to do in how the memory is then mapped by the userspace to read from it. If I allocate it consistente in snd_dma_alloc_pages then you can let the user remap the area as cached in his own virtual mapping. What I'm trying to said is that behind the scene everything is consistent but the user will get a cache line read during the first access and then he/she will read from the cache. Maybe is this assumption is totally wrong?
Ah I see your point now. I believe that this kind of mapping tweak could be done, but this doesn't satisfy the expectation of the mmap of the current sound API; e.g. dmix / dsnoop would fail. So, if any, this should be an extension for some special usages.

Yes, I understand the dmix problem but I think that the writer is still a single thread that mix the sources and why dsnoop is a problem? Sorry I don't know the logic how they are implemented.

...

My original idea was to totally go away from the coherent allocation and mapping, but just let dynamically syncing like other drivers do (e.g. net devices), aligned with mmap_begin/mmap_commit in alsa-lib.

Agree I have seen in how make the transition simple and this is the reason that I was exploring the first one

Michael

...

Takashi

-- Michael Nazzareno Trimarchi Amarula Solutions BV COO Co-Founder Cruquiuskade 47 Amsterdam 1018 AM NL T. +31(0)851119172 M. +39(0)3479132170 [`as] https://www.amarulasolutions.com

1860

Age (days ago)

1862

Last active (days ago)

List overview

Download

6 comments

2 participants

tags (0)

participants (2)

Michael Nazzareno Trimarchi
Takashi Iwai