Re: Poor performace on mmap reading arm64 on audio device

23 Nov 2020


      Hi
On Mon, Nov 23, 2020 at 2:23 PM Takashi Iwai tiwai@suse.de wrote:
...
On Sat, 21 Nov 2020 10:40:04 +0100,
Michael Nazzareno Trimarchi wrote:
...
Hi all
I'm trying to figure out how to increase performance on audio reading
using the mmap interface. Right now what I understand it's that
allocation comes from core/memalloc.c ops that allocate the memory for
dma under driver/dma.
The reference platform I have is an imx8mm and the allocation in arm64 is:
0xffff800011ff5000-0xffff800012005000          64K PTE       RW NX SHD
AF            UXN MEM/NORMAL-NC
This is the reason that is allocated for dma interface.
Now access linear on the multichannel interface the performance is bad
but worse if I try to access a channel a time on read.
So it looks like it is better to copy the block using memcpy on a
cached area and then operate on a single channel sample. If it's
correct what I'm saying the mmap_begin and mmap_commit
basically they don't do anything on cache level so the page mapping
and way is used is always the same. Can the interface be modified to
allow cache the area during read and restore in the commit
phase?
The current API of the mmap for the sound ring-buffer is designed to
allow concurrent accesses at any time in the minimalistic kernel-user
context switching.  So the whole buffer is allocated as coherent and
mmapped in a shot.  It's pretty efficient for architectures like x86,
but has disadvantages on ARM, indeed.
Each platform e/o architecture can specialize the mmap and declare the
area that is consistent in dma to me mapped
as no cache one
vma->vm_page_prot = pgprot_cached(vma->vm_page_prot);
                return remap_pfn_range(vma, vma->vm_start,
                                vma->vm_end - vma->vm_start, vma->vm_page_prot);
I have done it for testing purposes. This give an idea
- read multi channel not sequentially took around 12% of the cpu with
mmap interface
- read multi channel use after a memcpy took around 6%
- read on a cached area took around 3%. I'm trying to figure out how
and when invalidate the area
I have two use cases:
- write on the channels (no performance issue)
- read on channels
Before reading I should only say that the cached area is not in sync
with memory. I think that supporting write use cases
makes little sense here.
...
The mmap_begin and mmap_commit are the concepts in the alsa-lib side
for supporting the plugins better, and they doesn't represent kernel
ABI.  So, this extension would be needed at first, and the memory
allocation mechanism has to be changed as well.  Last but not least,
Are you sure about memory allocation, or just memory mapping?
...
the concurrency has to be reconsidered if this approach is taken.
Yes I know that is a big problem anyway. I don't have a big idea how solve it
Michael
...
That said, it's possible in theory, but practically no trivial task.
thanks,
Takashi
-- 
Michael Nazzareno Trimarchi
Amarula Solutions BV
COO Co-Founder
Cruquiuskade 47 Amsterdam 1018 AM NL
T. +31(0)851119172
M. +39(0)3479132170
[`as] https://www.amarulasolutions.com