[alsa-devel] ALSA processor usage is too high
I am trying to decode 8 MP3 files simultaneously. Each file is 3 minutes 11 seconds long. When I just decode the files and copy the PCM data to memory it takes 2 minutes 36 seconds. When I do the same test after opening 8 ALSA PCM streams and while writing data to them it takes 4 minutes 37 seconds. This is much too slow for what should be a simple copy operation.
I need to have 8 MP3 files decoding and playing on our custom board. I have taken the driver as far as I can on my own. I am seeking expert help to get this working as soon as possible. I am sorry if this was not the appropriate place to post such a request. Please contact me directly if you are interested in working on this project.
The custom board is setup as follows:
Analog Devices BF537 processor running uClinux from svn://blackfin.uclinux.org/uclinux-dist/trunk uclinux-dist Using alsa-lib-1.0.23 from http://www.alsa-project.org Two Cirrus Logic CS42448 CODECs connected to BF537 SPORT0 Primary and Secondary data lines in Multichannel mode CS42448 configured for TDM (32 bits per channel * 8 channels = 256 bits per frame) ICS661 Audio Clock at (256*48000Hz) connected to BF537 SPORT0 and Both CS42448 CODECs for bit clocking You can find the driver here: http://www.alcorn.com/ftp/swap/sound_cs42448.zip
Thank you, Adam
Adam Rosenberg Software Engineer
Alcorn McBride Inc. 3300 South Hiawassee Building 105 Orlando, FL 32835
(407) 296 - 5800 ext. 5490
Adam Rosenberg wrote:
I am trying to decode 8 MP3 files simultaneously. Each file is 3 minutes 11 seconds long. When I just decode the files and copy the PCM data to memory it takes 2 minutes 36 seconds. When I do the same test after opening 8 ALSA PCM streams and while writing data to them it takes 4 minutes 37 seconds. This is much too slow for what should be a simple copy operation.
Are you using the "hw" device? Otherwise, it's not a simple copy op.
How much CPU does "aplay -D hw -t raw -f dat /dev/zero" use?
Regards, Clemens
On Fri, Nov 5, 2010 at 10:31 AM, Clemens Ladisch clemens@ladisch.de wrote:
Adam Rosenberg wrote:
I am trying to decode 8 MP3 files simultaneously. Each file is 3 minutes 11 seconds long. When I just decode the files and copy the PCM data to memory it takes 2 minutes 36 seconds. When I do the same test after opening 8 ALSA PCM streams and while writing data to them it takes 4 minutes 37 seconds. This is much too slow for what should be a simple copy operation.
Are you using the "hw" device? Otherwise, it's not a simple copy op.
I am using the hw device for each stream.
How much CPU does "aplay -D hw -t raw -f dat /dev/zero" use?
I do not know how to calculate CPU usage for a given process.
All of my calculations have just been done by running the application I wrote (which has to do a number of other things while decoding mp3 data and playing audio). I time the application from start to finish to determine if it is handling all of the tasks in an acceptable amount of time.
For the audio playback I am polling the streams using snd_pcm_avail_update() and then writing the number of frames available using snd_pcm_writei(). I am trying to squish this whole project into 2mb of flash so I will not be able to include aplay in the final os image.
I am able to add aplay for testing so I used your command to open the 8 streams, removed the audio processing from my application, and ran the test again. The result was 4 minutes and 35 seconds, which is basically the same as the result from the test within my application and much too slow.
Thanks, Adam
Adam Rosenberg wrote:
On Fri, Nov 5, 2010 at 10:31 AM, Clemens Ladisch clemens@ladisch.de wrote:
How much CPU does "aplay -D hw -t raw -f dat /dev/zero" use?
I do not know how to calculate CPU usage for a given process.
The time utility (if you have it) measures both elapsed and actually used CPU time.
For the audio playback I am polling the streams using snd_pcm_avail_update() and then writing the number of frames available using snd_pcm_writei().
And what does your program do when avail_update returns 0 frames?
I am able to add aplay for testing so I used your command to open the 8 streams, removed the audio processing from my application, and ran the test again. The result was 4 minutes and 35 seconds, which is basically the same as the result from the test within my application and much too slow.
You cannot write data faster than it's playing; the audio ring buffer has a finite size.
You would have a problem if the processing and/or the driver would make everything so slow that you wouldn't be able to write new data to the device fast enough, which would result in an buffer underrun. Does this actually happen?
Regards, Clemens
On Fri, Nov 5, 2010 at 1:58 PM, Clemens Ladisch clemens@ladisch.de wrote:
Adam Rosenberg wrote:
On Fri, Nov 5, 2010 at 10:31 AM, Clemens Ladisch clemens@ladisch.de wrote:
How much CPU does "aplay -D hw -t raw -f dat /dev/zero" use?
I do not know how to calculate CPU usage for a given process.
The time utility (if you have it) measures both elapsed and actually used CPU time.
I am not sure how to interpret this, but I told aplay to play for 3 minutes from /dev/zero and here are the results: root:/> time aplay -d 180 -D hw -t raw -f dat /dev/zero Playing raw data '/dev/zero' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo real 3m 0.10s user 0m 0.22s sys 0m 8.48s
For the audio playback I am polling the streams using snd_pcm_avail_update() and then writing the number of frames available using snd_pcm_writei().
And what does your program do when avail_update returns 0 frames?
If it returns 0 then I do not write any frames. I then check the next stream. This continues in an infinite loop.
You cannot write data faster than it's playing; the audio ring buffer has a finite size.
You would have a problem if the processing and/or the driver would make everything so slow that you wouldn't be able to write new data to the device fast enough, which would result in an buffer underrun. Does this actually happen?
I am currently writing the decoded mp3 data to a buffer in RAM so that the program is decoding the mp3 data as fast as it can. I then run the audio process separately and just play silence. I am doing this so that I can tell how much time is being spent decoding mp3 data and processing audio data so that I know how much time remains for other tasks. From the times I have calculated I can tell that a buffer underrun would occur frequently if I was actually writing the decoded mp3 data to the pcm streams.
-Adam
On Fri, Nov 05, 2010 at 02:45:41PM -0400, Adam Rosenberg wrote:
I am not sure how to interpret this, but I told aplay to play for 3 minutes from /dev/zero and here are the results: root:/> time aplay -d 180 -D hw -t raw -f dat /dev/zero Playing raw data '/dev/zero' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo real 3m 0.10s
This means that the application ran for 3m 0.1s...
user 0m 0.22s
...during this time it spent 220ms in the actual application
sys 0m 8.48s
...and 8.48s in kernel mode.
Adam Rosenberg wrote:
I am not sure how to interpret this, but I told aplay to play for 3 minutes from /dev/zero and here are the results: root:/> time aplay -d 180 -D hw -t raw -f dat /dev/zero Playing raw data '/dev/zero' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo real 3m 0.10s user 0m 0.22s sys 0m 8.48s
9s / 180s = 5%
And what does your program do when avail_update returns 0 frames?
If it returns 0 then I do not write any frames. I then check the next stream. This continues in an infinite loop.
When the program is looping and waiting for some free space to become available in any of the eight buffers, it doesn't actually process audio data. (And you should use poll() with all eight handles so that you don't eat CPU while waiting.)
The aplay experiment above tells me that your program spent 95% of its time calling snd_pcm_avail_update. In that time, you could decode instead.
You cannot write data faster than it's playing; the audio ring buffer has a finite size.
You would have a problem if the processing and/or the driver would make everything so slow that you wouldn't be able to write new data to the device fast enough, which would result in an buffer underrun. Does this actually happen?
I am currently writing the decoded mp3 data to a buffer in RAM so that the program is decoding the mp3 data as fast as it can. I then run the audio process separately and just play silence. I am doing this so that I can tell how much time is being spent decoding mp3 data and processing audio data so that I know how much time remains for other tasks.
Playing silence is not any faster than playing anything else, because the sound card _cannot_ run faster than the configured sample rate.
Regards, Clemens
On Fri, Nov 5, 2010 at 3:10 PM, Clemens Ladisch clemens@ladisch.de wrote:
When the program is looping and waiting for some free space to become available in any of the eight buffers, it doesn't actually process audio data. (And you should use poll() with all eight handles so that you don't eat CPU while waiting.)
I can't use poll because the application has to perform many other tasks in a deterministic manner (meaning I can only use threads and other processes to notify the main loop to perform some task). I tried using the async callback method so that I could set a flag when it was time to copy more audio data to the stream but that didn't seem to work well with multiple streams. I found that polling using avail_update was the only reliable method. Could you provide an alternative example that is known to work with multiple streams in the same application?
The aplay experiment above tells me that your program spent 95% of its time calling snd_pcm_avail_update. In that time, you could decode instead.
Sorry for the confusion, the main loop of my application basically does this: while(1) { processNextAlsaStream(); processMp3Decoder(); processLCD(); processInputs(); processSerial(); }
so the processNextAlsaStream() function just calls avail_update for the next stream in my list of 8 streams and then handles the result before allowing the loop to process the next task.
Playing silence is not any faster than playing anything else, because the sound card _cannot_ run faster than the configured sample rate.
I agree. I only mentioned it was silence so that it was understood I have a static buffer that I am copying audio frames from so there is no other processing needed (no reading from a file, etc).
Thank you for your help, I am happy to be discussing this with you all as it makes me feel as though I am not totally lost. Please let me know if you have an example of a program that can efficiently handle multiple PCM streams.
Thanks! Adam
On Fri, Nov 05, 2010 at 03:27:01PM -0400, Adam Rosenberg wrote:
avail_update was the only reliable method. Could you provide an alternative example that is known to work with multiple streams in the same application?
poll() is designed for this application.
Sorry for the confusion, the main loop of my application basically does this: while(1) { processNextAlsaStream(); processMp3Decoder(); processLCD(); processInputs(); processSerial(); }
What you appear to be saying here is that your application which busy waits is consuming a lot of CPU - this isn't entirely surprising, as with many APIs in Linux the ALSA APIs are designed to be event driven. If you really need to do this I'd suggest having all the functions which can wait for input (at a guess at least the ALSA, input and serial ones) converted to wait for events on their fds using poll(), epoll() or whatever and if you desperately need to busy wait then do this by using poll() on an epoll fd with a timeout of zero. This will reduce the overhead you incur for busy waiting.
Adam Rosenberg wrote:
On Fri, Nov 5, 2010 at 3:10 PM, Clemens Ladisch clemens@ladisch.de wrote:
And you should use poll() with all eight handles so that you don't eat CPU while waiting.)
I can't use poll because the application has to perform many other tasks in a deterministic manner (meaning I can only use threads and other processes to notify the main loop to perform some task).
As Mark wrote, this is what poll() was designed for.
the main loop of my application basically does this: while(1) { processNextAlsaStream(); processMp3Decoder(); processLCD(); processInputs(); processSerial(); }
With poll(), it would look somewhat like this:
struct pollfd pollfds[...]; // fill pollfds with all handles while (1) { poll(...); for (1..8) if (stream ready for writing) processAlsaStream(i); if (input ready for reading) processInputs(); if (serial ready for whatever) processSerial(); processMp3Decoder(); }
If you set the PCM device to non-blocking mode, you do not need to call avail_update before writing; just try to write as much as you currently have.
If you want to do something regularly, use the timeout of poll(), or use a timerfd.
You mentioned threads; these are not directly supported with poll() because they do not have a file handle, but if you want to wake up the main loop, you can write to an eventfd or to a pipe created with pipe().
Regards, Clemens
participants (3)
-
Adam Rosenberg
-
Clemens Ladisch
-
Mark Brown