[alsa-devel] EBADFD caused by commit dec428c352217010e4b8bd750d302b8062339d32
Hi!
Since alsa-lib commit dec428c352217010e4b8bd750d302b8062339d32, I've occationally been hit by an EBADFD whenever any program tries to play sound. The situation I get is that the first shmget succeds, so dmix->shmid >= 0, therefore first_instance = 0. But buf.shm_nattach = 1, so before the commit shmptr would have been zeroed out, but isn't anymore. And since I still have: dmix->shmptr->magic == SND_PCM_DIRECT_MAGIC, I don't get EINVAL, but EBADFD, somewhere down the line.
From what I understand, the race condition that was fixed would still
be avoided if shmptr was zeroed on (first_instance || buf.shm_nattch == 1). If that is the case, would you please consider applying attached diff?
Regards, Lars Lindqvist
[Added Qing Cai to Cc, who was the author of the patch in question]
On Sun, 10 Apr 2016 23:57:11 +0200, Lars Lindqvist wrote:
Hi!
Since alsa-lib commit dec428c352217010e4b8bd750d302b8062339d32, I've occationally been hit by an EBADFD whenever any program tries to play sound. The situation I get is that the first shmget succeds, so dmix->shmid >= 0, therefore first_instance = 0.
I wonder how does this succeed? It's a leftover shmem? But then why it contains the garbage...?
But buf.shm_nattach = 1, so before the commit shmptr would have been zeroed out, but isn't anymore. And since I still have: dmix->shmptr->magic == SND_PCM_DIRECT_MAGIC, I don't get EINVAL, but EBADFD, somewhere down the line.
Could you give which line actually gives EBADFD?
From what I understand, the race condition that was fixed would still
be avoided if shmptr was zeroed on (first_instance || buf.shm_nattch == 1). If that is the case, would you please consider applying attached diff?
This may work, but I still would like to see how another unexpected situation happens.
thanks,
Takashi
Regards, Lars Lindqvist diff -Naur alsa-lib-1.1.1.orig/src/pcm/pcm_direct.c alsa-lib-1.1.1/src/pcm/pcm_direct.c --- alsa-lib-1.1.1.orig/src/pcm/pcm_direct.c 2016-03-31 15:10:39.000000000 +0200 +++ alsa-lib-1.1.1/src/pcm/pcm_direct.c 2016-04-10 17:44:08.815456305 +0200 @@ -125,7 +125,7 @@ snd_pcm_direct_shm_discard(dmix); return err; }
- if (first_instance) { /* we're the first user, clear the segment */
- if (first_instance || buf.shm_nattch == 1) { /* we're the first user, clear the segment */ memset(dmix->shmptr, 0, sizeof(snd_pcm_direct_share_t)); if (dmix->ipc_gid >= 0) { buf.shm_perm.gid = dmix->ipc_gid;
On 2016-04-11 Takashi Iwai wrote:
[Added Qing Cai to Cc, who was the author of the patch in question]
On Sun, 10 Apr 2016 23:57:11 +0200, Lars Lindqvist wrote:
Hi!
Since alsa-lib commit dec428c352217010e4b8bd750d302b8062339d32, I've occationally been hit by an EBADFD whenever any program tries to play sound. The situation I get is that the first shmget succeds, so dmix->shmid >= 0, therefore first_instance = 0.
I wonder how does this succeed? It's a leftover shmem? But then why it contains the garbage...?
I seem to be able to trigger it by having one client open, starting another, and quickly closing the first one. I'm not sufficiently familiar with the alsa source, but I would guess that shmget succeeds since someone is already attached, and then the shmem gets deconstructed when the first one closes. Leaving it in a bad state for the second client. So a new type of race situation is possible. However, in this scenario, if I just stop playback or kill the processes, on the next startup of an alsa-lib user everything is fine again. So first_instance == 1 and buf.shm_nattch == 1.
This is in contrast to the occational problem I've been having, that I don't know exactly how to trigger, where shmget apparently always succeeds, giving first_instace == 0 and buf.shm_nattch == 1. Which I've only been able to fix by completely resetting the driver by rmmod + modprobe.
But buf.shm_nattach = 1, so before the commit shmptr would have been zeroed out, but isn't anymore. And since I still have: dmix->shmptr->magic == SND_PCM_DIRECT_MAGIC, I don't get EINVAL, but EBADFD, somewhere down the line.
Could you give which line actually gives EBADFD?
In the case where I can trigger it by will, it is from pcm_dmix.c:1074, where snd_pcm_open_slave returns -EBADFD. But I don't know where it comes from in the spontaneous, more persistent, case. I'm sure it is not the same place, since SNDERR("unable to open slave") is run when snd_pcm_open_slave() < 0, but I get no such message in this case. I'll try to pinpoint it the next chance I get.
From what I understand, the race condition that was fixed would still
be avoided if shmptr was zeroed on (first_instance || buf.shm_nattch == 1). If that is the case, would you please consider applying attached diff?
This may work, but I still would like to see how another unexpected situation happens.
thanks,
Takashi
Regards, Lars Lindqvist diff -Naur alsa-lib-1.1.1.orig/src/pcm/pcm_direct.c alsa-lib-1.1.1/src/pcm/pcm_direct.c --- alsa-lib-1.1.1.orig/src/pcm/pcm_direct.c 2016-03-31 15:10:39.000000000 +0200 +++ alsa-lib-1.1.1/src/pcm/pcm_direct.c 2016-04-10 17:44:08.815456305 +0200 @@ -125,7 +125,7 @@ snd_pcm_direct_shm_discard(dmix); return err; }
- if (first_instance) { /* we're the first user, clear the segment */
- if (first_instance || buf.shm_nattch == 1) { /* we're the first user, clear the segment */ memset(dmix->shmptr, 0, sizeof(snd_pcm_direct_share_t)); if (dmix->ipc_gid >= 0) { buf.shm_perm.gid = dmix->ipc_gid;
On Mon, 11 Apr 2016 17:08:01 +0200, Lars Lindqvist wrote:
On 2016-04-11 Takashi Iwai wrote:
[Added Qing Cai to Cc, who was the author of the patch in question]
On Sun, 10 Apr 2016 23:57:11 +0200, Lars Lindqvist wrote:
Hi!
Since alsa-lib commit dec428c352217010e4b8bd750d302b8062339d32, I've occationally been hit by an EBADFD whenever any program tries to play sound. The situation I get is that the first shmget succeds, so dmix->shmid >= 0, therefore first_instance = 0.
I wonder how does this succeed? It's a leftover shmem? But then why it contains the garbage...?
I seem to be able to trigger it by having one client open, starting another, and quickly closing the first one.
One possible case is that the second stream is opened almost at the same time as the first stream, and the second stream reaches to the point checking SND_PCM_DIRECT_MAGIC before the first one finishes the initialization. Does a hackish patch like below make any difference?
Takashi
--- diff --git a/src/pcm/pcm_direct.c b/src/pcm/pcm_direct.c index 14de734d98eb..d0aa4258004f 100644 --- a/src/pcm/pcm_direct.c +++ b/src/pcm/pcm_direct.c @@ -92,6 +92,7 @@ int snd_pcm_direct_shm_create_or_connect(snd_pcm_direct_t *dmix) { struct shmid_ds buf; int tmpid, err, first_instance = 0; + int repeat = 0; retryget: dmix->shmid = shmget(dmix->ipc_key, sizeof(snd_pcm_direct_share_t), @@ -136,6 +137,9 @@ retryget: } else { if (dmix->shmptr->magic != SND_PCM_DIRECT_MAGIC) { snd_pcm_direct_shm_discard(dmix); + /* might be racy, let's retry for a few times */ + if (++repeat < 10) + goto retryget; return -EINVAL; } }
Den 2016-04-12 skrev Takashi Iwai:
On Mon, 11 Apr 2016 17:08:01 +0200, Lars Lindqvist wrote:
On 2016-04-11 Takashi Iwai wrote:
[Added Qing Cai to Cc, who was the author of the patch in question]
On Sun, 10 Apr 2016 23:57:11 +0200, Lars Lindqvist wrote:
Hi!
Since alsa-lib commit dec428c352217010e4b8bd750d302b8062339d32, I've occationally been hit by an EBADFD whenever any program tries to play sound. The situation I get is that the first shmget succeds, so dmix->shmid >= 0, therefore first_instance = 0.
I wonder how does this succeed? It's a leftover shmem? But then why it contains the garbage...?
I seem to be able to trigger it by having one client open, starting another, and quickly closing the first one.
One possible case is that the second stream is opened almost at the same time as the first stream, and the second stream reaches to the point checking SND_PCM_DIRECT_MAGIC before the first one finishes the initialization. Does a hackish patch like below make any difference?
No it doesn't, the magic check succeeds. The situation is that the first one has been open for "a long time" when I open the second one. So I would think that initialization has been completed properly. If the second one opens just as the first closes, it might be that the second assigns (correctly) first_instance = 0, but by the time it reaches the "if (first_instance)" check, the first one has closed, and left garbage.
Lars
diff --git a/src/pcm/pcm_direct.c b/src/pcm/pcm_direct.c index 14de734d98eb..d0aa4258004f 100644 --- a/src/pcm/pcm_direct.c +++ b/src/pcm/pcm_direct.c @@ -92,6 +92,7 @@ int snd_pcm_direct_shm_create_or_connect(snd_pcm_direct_t *dmix) { struct shmid_ds buf; int tmpid, err, first_instance = 0;
- int repeat = 0;
retryget: dmix->shmid = shmget(dmix->ipc_key, sizeof(snd_pcm_direct_share_t), @@ -136,6 +137,9 @@ retryget: } else { if (dmix->shmptr->magic != SND_PCM_DIRECT_MAGIC) { snd_pcm_direct_shm_discard(dmix);
/* might be racy, let's retry for a few times */
if (++repeat < 10)
} }goto retryget; return -EINVAL;
On Tue, 12 Apr 2016 16:52:28 +0200, Lars Lindqvist wrote:
Den 2016-04-12 skrev Takashi Iwai:
On Mon, 11 Apr 2016 17:08:01 +0200, Lars Lindqvist wrote:
On 2016-04-11 Takashi Iwai wrote:
[Added Qing Cai to Cc, who was the author of the patch in question]
On Sun, 10 Apr 2016 23:57:11 +0200, Lars Lindqvist wrote:
Hi!
Since alsa-lib commit dec428c352217010e4b8bd750d302b8062339d32, I've occationally been hit by an EBADFD whenever any program tries to play sound. The situation I get is that the first shmget succeds, so dmix->shmid >= 0, therefore first_instance = 0.
I wonder how does this succeed? It's a leftover shmem? But then why it contains the garbage...?
I seem to be able to trigger it by having one client open, starting another, and quickly closing the first one.
One possible case is that the second stream is opened almost at the same time as the first stream, and the second stream reaches to the point checking SND_PCM_DIRECT_MAGIC before the first one finishes the initialization. Does a hackish patch like below make any difference?
No it doesn't, the magic check succeeds. The situation is that the first one has been open for "a long time" when I open the second one. So I would think that initialization has been completed properly. If the second one opens just as the first closes, it might be that the second assigns (correctly) first_instance = 0, but by the time it reaches the "if (first_instance)" check, the first one has closed, and left garbage.
Actually we have a semaphore before shm access, so the race at two opens shouldn't happen. I noticed it after I sent my previous mail.
But the semaphore is taken also at snd_pcm_dmix_close(). So I wonder where the race actually happens. Both open and close must be protected while another stream is opening or closing.
Could you try to check where you get the exact error...?
Takashi
Den 2016-04-12 skrev Takashi Iwai:
Actually we have a semaphore before shm access, so the race at two opens shouldn't happen. I noticed it after I sent my previous mail.
But the semaphore is taken also at snd_pcm_dmix_close(). So I wonder where the race actually happens. Both open and close must be protected while another stream is opening or closing.
Could you try to check where you get the exact error...?
The execution tree, as far as I can find is the following:
snd_pcm_dmix_open: * Line 1009 snd_pcm_direct_shm_create_or_connect with the code in question, * which returns 0. So we end up in line 1058, dmix->shmtr->use_server is 0, * so go to line 1072.. running: snd_pcm_open_slave, running: snd_pcm_open_named_slave, running: snd_pcm_open_conf * where snd_dlobj_cache_get gives open_func = snd_pcm_hw_open, so snd_pcm_hw_open() snd_open_device("/dev/snd/pcmC0D0p") * where open() returns -1 with errno = EBADFD
Regards, Lars
On Tue, 12 Apr 2016 18:46:17 +0200, Lars Lindqvist wrote:
Den 2016-04-12 skrev Takashi Iwai:
Actually we have a semaphore before shm access, so the race at two opens shouldn't happen. I noticed it after I sent my previous mail.
But the semaphore is taken also at snd_pcm_dmix_close(). So I wonder where the race actually happens. Both open and close must be protected while another stream is opening or closing.
Could you try to check where you get the exact error...?
The execution tree, as far as I can find is the following:
snd_pcm_dmix_open:
- Line 1009 snd_pcm_direct_shm_create_or_connect with the code in question,
- which returns 0. So we end up in line 1058, dmix->shmtr->use_server is 0,
- so go to line 1072.. running:
snd_pcm_open_slave, running: snd_pcm_open_named_slave, running: snd_pcm_open_conf
- where snd_dlobj_cache_get gives open_func = snd_pcm_hw_open, so
snd_pcm_hw_open() snd_open_device("/dev/snd/pcmC0D0p")
- where open() returns -1 with errno = EBADFD
OK, then the question is why other stream could be closed while this is being opened and protected via semaphore. Maybe the semaphore protection isn't perfect?
In anyway, in such a case, we may retry opening the stream as the first element. This is safer than blindly assuming the first element via nattach value (which is racy). An untested patch is below.
thanks,
Takashi
--- diff --git a/src/pcm/pcm_dmix.c b/src/pcm/pcm_dmix.c index b26a5c790e7e..007d35664ce7 100644 --- a/src/pcm/pcm_dmix.c +++ b/src/pcm/pcm_dmix.c @@ -1020,6 +1020,7 @@ int snd_pcm_dmix_open(snd_pcm_t **pcmp, const char *name, dmix->max_periods = opts->max_periods; dmix->sync_ptr = snd_pcm_dmix_sync_ptr;
+ retry: if (first_instance) { /* recursion is already checked in snd_pcm_direct_get_slave_ipc_offset() */ @@ -1076,6 +1077,13 @@ int snd_pcm_dmix_open(snd_pcm_t **pcmp, const char *name, SND_PCM_APPEND, NULL); if (ret < 0) { + /* all other streams have been closed; + * retry as the first instance + */ + if (ret == -EBADFD) { + first_instance = 1; + goto retry; + } SNDERR("unable to open slave"); goto _err; }
Den 2016-04-12 skrev Takashi Iwai:
On Tue, 12 Apr 2016 18:46:17 +0200, Lars Lindqvist wrote:
Den 2016-04-12 skrev Takashi Iwai:
Actually we have a semaphore before shm access, so the race at two opens shouldn't happen. I noticed it after I sent my previous mail.
But the semaphore is taken also at snd_pcm_dmix_close(). So I wonder where the race actually happens. Both open and close must be protected while another stream is opening or closing.
Could you try to check where you get the exact error...?
The execution tree, as far as I can find is the following:
snd_pcm_dmix_open:
- Line 1009 snd_pcm_direct_shm_create_or_connect with the code in question,
- which returns 0. So we end up in line 1058, dmix->shmtr->use_server is 0,
- so go to line 1072.. running:
snd_pcm_open_slave, running: snd_pcm_open_named_slave, running: snd_pcm_open_conf
- where snd_dlobj_cache_get gives open_func = snd_pcm_hw_open, so
snd_pcm_hw_open() snd_open_device("/dev/snd/pcmC0D0p")
- where open() returns -1 with errno = EBADFD
OK, then the question is why other stream could be closed while this is being opened and protected via semaphore. Maybe the semaphore protection isn't perfect?
In anyway, in such a case, we may retry opening the stream as the first element. This is safer than blindly assuming the first element via nattach value (which is racy). An untested patch is below.
Yes, this seems to work. I'm at least not able to trigger the problem myself anymore. I'll use this for a few days, and report back if I happen to get any unexpected EBADFDs.
Thanks! Lars
On Tue, 12 Apr 2016 22:24:08 +0200, Lars Lindqvist wrote:
Den 2016-04-12 skrev Takashi Iwai:
On Tue, 12 Apr 2016 18:46:17 +0200, Lars Lindqvist wrote:
Den 2016-04-12 skrev Takashi Iwai:
Actually we have a semaphore before shm access, so the race at two opens shouldn't happen. I noticed it after I sent my previous mail.
But the semaphore is taken also at snd_pcm_dmix_close(). So I wonder where the race actually happens. Both open and close must be protected while another stream is opening or closing.
Could you try to check where you get the exact error...?
The execution tree, as far as I can find is the following:
snd_pcm_dmix_open:
- Line 1009 snd_pcm_direct_shm_create_or_connect with the code in question,
- which returns 0. So we end up in line 1058, dmix->shmtr->use_server is 0,
- so go to line 1072.. running:
snd_pcm_open_slave, running: snd_pcm_open_named_slave, running: snd_pcm_open_conf
- where snd_dlobj_cache_get gives open_func = snd_pcm_hw_open, so
snd_pcm_hw_open() snd_open_device("/dev/snd/pcmC0D0p")
- where open() returns -1 with errno = EBADFD
OK, then the question is why other stream could be closed while this is being opened and protected via semaphore. Maybe the semaphore protection isn't perfect?
In anyway, in such a case, we may retry opening the stream as the first element. This is safer than blindly assuming the first element via nattach value (which is racy). An untested patch is below.
Yes, this seems to work. I'm at least not able to trigger the problem myself anymore. I'll use this for a few days, and report back if I happen to get any unexpected EBADFDs.
OK, I now applied the fix to git tree. Let's see whether this works stably enough. Let me know if you still see a similar problem with it.
thanks,
Takashi
participants (2)
-
Lars Lindqvist
-
Takashi Iwai