[alsa-devel] [PATCH 0/2] ALSA: pcm: Fix race condition in runtime access
Since - snd_pcm_detach_substream sets runtime to null without stream lock and - snd_pcm_period_elapsed checks the nullity of the runtime outside of stream lock.
This will trigger null memory access in snd_pcm_running() call in snd_pcm_period_elapsed.
paulhsia (2): ALSA: pcm: Fix stream lock usage in snd_pcm_period_elapsed() ALSA: pcm: Use stream lock in snd_pcm_detach_substream()
sound/core/pcm.c | 8 +++++++- sound/core/pcm_lib.c | 8 ++++++-- 2 files changed, 13 insertions(+), 3 deletions(-)
If the nullity check for `substream->runtime` is outside of the lock region, it is possible to have a null runtime in the critical section if snd_pcm_detach_substream is called right before the lock.
Signed-off-by: paulhsia paulhsia@chromium.org --- sound/core/pcm_lib.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/sound/core/pcm_lib.c b/sound/core/pcm_lib.c index d80041ea4e01..2236b5e0c1f2 100644 --- a/sound/core/pcm_lib.c +++ b/sound/core/pcm_lib.c @@ -1782,11 +1782,14 @@ void snd_pcm_period_elapsed(struct snd_pcm_substream *substream) struct snd_pcm_runtime *runtime; unsigned long flags;
- if (PCM_RUNTIME_CHECK(substream)) + if (snd_BUG_ON(!substream)) return; - runtime = substream->runtime;
snd_pcm_stream_lock_irqsave(substream, flags); + if (PCM_RUNTIME_CHECK(substream)) + goto _unlock; + runtime = substream->runtime; + if (!snd_pcm_running(substream) || snd_pcm_update_hw_ptr0(substream, 1) < 0) goto _end; @@ -1797,6 +1800,7 @@ void snd_pcm_period_elapsed(struct snd_pcm_substream *substream) #endif _end: kill_fasync(&runtime->fasync, SIGIO, POLL_IN); + _unlock: snd_pcm_stream_unlock_irqrestore(substream, flags); } EXPORT_SYMBOL(snd_pcm_period_elapsed);
Since snd_pcm_detach_substream modifies the input substream, the function should use stream lock to protect the modification section.
Signed-off-by: paulhsia paulhsia@chromium.org --- sound/core/pcm.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/sound/core/pcm.c b/sound/core/pcm.c index 9a72d641743d..06da9cb8984a 100644 --- a/sound/core/pcm.c +++ b/sound/core/pcm.c @@ -980,8 +980,12 @@ void snd_pcm_detach_substream(struct snd_pcm_substream *substream) { struct snd_pcm_runtime *runtime;
- if (PCM_RUNTIME_CHECK(substream)) + if (snd_BUG_ON(!substream)) return; + + snd_pcm_stream_lock_irq(substream); + if (PCM_RUNTIME_CHECK(substream)) + goto unlock; runtime = substream->runtime; if (runtime->private_free != NULL) runtime->private_free(runtime); @@ -1000,6 +1004,8 @@ void snd_pcm_detach_substream(struct snd_pcm_substream *substream) put_pid(substream->pid); substream->pid = NULL; substream->pstr->substream_opened--; + unlock: + snd_pcm_stream_unlock_irq(substream); }
static ssize_t show_pcm_class(struct device *dev,
On Tue, 12 Nov 2019 18:17:13 +0100, paulhsia wrote:
Since
- snd_pcm_detach_substream sets runtime to null without stream lock and
- snd_pcm_period_elapsed checks the nullity of the runtime outside of stream lock.
This will trigger null memory access in snd_pcm_running() call in snd_pcm_period_elapsed.
Well, if a stream is detached, it means that the stream must have been already closed; i.e. it's already a clear bug in the driver that snd_pcm_period_elapsed() is called against such a stream.
Or am I missing other possible case?
thanks,
Takashi
paulhsia (2): ALSA: pcm: Fix stream lock usage in snd_pcm_period_elapsed() ALSA: pcm: Use stream lock in snd_pcm_detach_substream()
sound/core/pcm.c | 8 +++++++- sound/core/pcm_lib.c | 8 ++++++-- 2 files changed, 13 insertions(+), 3 deletions(-)
-- 2.24.0.rc1.363.gb1bccd3e3d-goog
On Wed, Nov 13, 2019 at 2:16 AM Takashi Iwai tiwai@suse.de wrote:
On Tue, 12 Nov 2019 18:17:13 +0100, paulhsia wrote:
Since
- snd_pcm_detach_substream sets runtime to null without stream lock and
- snd_pcm_period_elapsed checks the nullity of the runtime outside of stream lock.
This will trigger null memory access in snd_pcm_running() call in snd_pcm_period_elapsed.
Well, if a stream is detached, it means that the stream must have been already closed; i.e. it's already a clear bug in the driver that snd_pcm_period_elapsed() is called against such a stream.
Or am I missing other possible case?
thanks,
Takashi
In multithreaded environment, it is possible to have to access both `interrupt_handler` (from irq) and `substream close` (from snd_pcm_release) at the same time. Therefore, in driver implementation, if "substream close function" and the "code section where snd_pcm_period_elapsed() in" do not hold the same lock, then the following things can happen:
1. interrupt_handler -> goes into snd_pcm_period_elapsed with a valid sustream pointer 2. snd_pcm_release_substream: call close without blocking 3. snd_pcm_release_substream: call snd_pcm_detache_substream and set substream->runtime to NULL 4. interrupt_handler -> call snd_pcm_runtime() and crash while accessing fields in `substream->runtime`
e.g. In intel8x0.c driver for ac97 device, In driver intel8x0.c, `snd_pcm_period_elapsed` is called after checking `ichdev->substream` in `snd_intel8x0_update`. And if a `snd_pcm_release` call from alsa-lib and pass through close() and run to snd_pcm_detach_substream() in another thread, it's possible to trigger a crash. I can reproduce the issue within a multithread VM easily.
My patches are trying to provide a basic protection for this situation (and internal pcm lock between detach and elapsed), since - the usage of `snd_pcm_period_elapsed` does not warn callers about the possible race if the driver does not force the order for `calling snd_pcm_period_elapsed` and `close` by lock and - lots of drivers already have this hidden issue and I can't fix them one by one (You can check the "snd_pcm_period_elapsed usage" and the "close implementation" within all the drivers). The most common mistake is that - Checking if the substream is null and call into snd_pcm_period_elapsed - But `close` can happen anytime, pass without block and snd_pcm_detach_substream will be trigger right after it
Best, Paul
paulhsia (2): ALSA: pcm: Fix stream lock usage in snd_pcm_period_elapsed() ALSA: pcm: Use stream lock in snd_pcm_detach_substream()
sound/core/pcm.c | 8 +++++++- sound/core/pcm_lib.c | 8 ++++++-- 2 files changed, 13 insertions(+), 3 deletions(-)
-- 2.24.0.rc1.363.gb1bccd3e3d-goog
On Wed, 13 Nov 2019 08:24:41 +0100, Chih-Yang Hsia wrote:
On Wed, Nov 13, 2019 at 2:16 AM Takashi Iwai tiwai@suse.de wrote:
On Tue, 12 Nov 2019 18:17:13 +0100, paulhsia wrote:
Since
- snd_pcm_detach_substream sets runtime to null without stream lock and
- snd_pcm_period_elapsed checks the nullity of the runtime outside of stream lock.
This will trigger null memory access in snd_pcm_running() call in snd_pcm_period_elapsed.
Well, if a stream is detached, it means that the stream must have been already closed; i.e. it's already a clear bug in the driver that snd_pcm_period_elapsed() is called against such a stream.
Or am I missing other possible case?
thanks,
Takashi
In multithreaded environment, it is possible to have to access both `interrupt_handler` (from irq) and `substream close` (from snd_pcm_release) at the same time. Therefore, in driver implementation, if "substream close function" and the "code section where snd_pcm_period_elapsed() in" do not hold the same lock, then the following things can happen:
- interrupt_handler -> goes into snd_pcm_period_elapsed with a valid
sustream pointer 2. snd_pcm_release_substream: call close without blocking 3. snd_pcm_release_substream: call snd_pcm_detache_substream and set substream->runtime to NULL 4. interrupt_handler -> call snd_pcm_runtime() and crash while accessing fields in `substream->runtime`
e.g. In intel8x0.c driver for ac97 device, In driver intel8x0.c, `snd_pcm_period_elapsed` is called after checking `ichdev->substream` in `snd_intel8x0_update`. And if a `snd_pcm_release` call from alsa-lib and pass through close() and run to snd_pcm_detach_substream() in another thread, it's possible to trigger a crash. I can reproduce the issue within a multithread VM easily.
My patches are trying to provide a basic protection for this situation (and internal pcm lock between detach and elapsed), since
- the usage of `snd_pcm_period_elapsed` does not warn callers about
the possible race if the driver does not force the order for `calling snd_pcm_period_elapsed` and `close` by lock and
- lots of drivers already have this hidden issue and I can't fix them
one by one (You can check the "snd_pcm_period_elapsed usage" and the "close implementation" within all the drivers). The most common mistake is that
- Checking if the substream is null and call into snd_pcm_period_elapsed
- But `close` can happen anytime, pass without block and
snd_pcm_detach_substream will be trigger right after it
Thanks, point taken. While this argument is valid and it's good to harden the PCM core side, the concurrent calls are basically a bug, and we'd need another fix in anyway. Also, the patch 2 makes little sense; there can't be multiple close calls racing with each other. So I'll go for taking your fix but only the first patch.
Back to this race: the surfaced issue is, as you pointed out, the race between snd_pcm_period_elapsed() vs close call. However, the fundamental problem is the pending action after the PCM trigger-stop call. Since the PCM trigger doesn't block nor wait until the hardware actually stops the things, the driver may go to the other step even after this "supposed-to-be-stopped" point. In your case, it goes up to close, and crashes. If we had a sync-stop operation, the interrupt handler should have finished before moving to the close stage, hence such a race could be avoided.
It's been a long known problem, and some drivers have the own implementation for stop-sync. I think it's time to investigate and start implementing the fundamental solution.
thanks,
Takashi
On Wed, 13 Nov 2019 10:47:51 +0100, Takashi Iwai wrote:
On Wed, 13 Nov 2019 08:24:41 +0100, Chih-Yang Hsia wrote:
On Wed, Nov 13, 2019 at 2:16 AM Takashi Iwai tiwai@suse.de wrote:
On Tue, 12 Nov 2019 18:17:13 +0100, paulhsia wrote:
Since
- snd_pcm_detach_substream sets runtime to null without stream lock and
- snd_pcm_period_elapsed checks the nullity of the runtime outside of stream lock.
This will trigger null memory access in snd_pcm_running() call in snd_pcm_period_elapsed.
Well, if a stream is detached, it means that the stream must have been already closed; i.e. it's already a clear bug in the driver that snd_pcm_period_elapsed() is called against such a stream.
Or am I missing other possible case?
thanks,
Takashi
In multithreaded environment, it is possible to have to access both `interrupt_handler` (from irq) and `substream close` (from snd_pcm_release) at the same time. Therefore, in driver implementation, if "substream close function" and the "code section where snd_pcm_period_elapsed() in" do not hold the same lock, then the following things can happen:
- interrupt_handler -> goes into snd_pcm_period_elapsed with a valid
sustream pointer 2. snd_pcm_release_substream: call close without blocking 3. snd_pcm_release_substream: call snd_pcm_detache_substream and set substream->runtime to NULL 4. interrupt_handler -> call snd_pcm_runtime() and crash while accessing fields in `substream->runtime`
e.g. In intel8x0.c driver for ac97 device, In driver intel8x0.c, `snd_pcm_period_elapsed` is called after checking `ichdev->substream` in `snd_intel8x0_update`. And if a `snd_pcm_release` call from alsa-lib and pass through close() and run to snd_pcm_detach_substream() in another thread, it's possible to trigger a crash. I can reproduce the issue within a multithread VM easily.
My patches are trying to provide a basic protection for this situation (and internal pcm lock between detach and elapsed), since
- the usage of `snd_pcm_period_elapsed` does not warn callers about
the possible race if the driver does not force the order for `calling snd_pcm_period_elapsed` and `close` by lock and
- lots of drivers already have this hidden issue and I can't fix them
one by one (You can check the "snd_pcm_period_elapsed usage" and the "close implementation" within all the drivers). The most common mistake is that
- Checking if the substream is null and call into snd_pcm_period_elapsed
- But `close` can happen anytime, pass without block and
snd_pcm_detach_substream will be trigger right after it
Thanks, point taken. While this argument is valid and it's good to harden the PCM core side, the concurrent calls are basically a bug, and we'd need another fix in anyway. Also, the patch 2 makes little sense; there can't be multiple close calls racing with each other. So I'll go for taking your fix but only the first patch.
Back to this race: the surfaced issue is, as you pointed out, the race between snd_pcm_period_elapsed() vs close call. However, the fundamental problem is the pending action after the PCM trigger-stop call. Since the PCM trigger doesn't block nor wait until the hardware actually stops the things, the driver may go to the other step even after this "supposed-to-be-stopped" point. In your case, it goes up to close, and crashes. If we had a sync-stop operation, the interrupt handler should have finished before moving to the close stage, hence such a race could be avoided.
It's been a long known problem, and some drivers have the own implementation for stop-sync. I think it's time to investigate and start implementing the fundamental solution.
BTW, what we need essentially for intel8x0 is to just call synchronize_irq() before closing, at best in hw_free procedure:
--- a/sound/pci/intel8x0.c +++ b/sound/pci/intel8x0.c @@ -923,8 +923,10 @@ static int snd_intel8x0_hw_params(struct snd_pcm_substream *substream,
static int snd_intel8x0_hw_free(struct snd_pcm_substream *substream) { + struct intel8x0 *chip = snd_pcm_substream_chip(substream); struct ichdev *ichdev = get_ichdev(substream);
+ synchronize_irq(chip->irq); if (ichdev->pcm_open_flag) { snd_ac97_pcm_close(ichdev->pcm); ichdev->pcm_open_flag = 0;
The same would be needed also at the beginning of the prepare, as the application may restart the stream without release.
My idea is to add sync_stop PCM ops and call it from PCM core at snd_pcm_prepare() and snd_pcm_hw_free().
thanks,
Takashi
On Wed, Nov 13, 2019 at 7:36 PM Takashi Iwai tiwai@suse.de wrote:
On Wed, 13 Nov 2019 10:47:51 +0100, Takashi Iwai wrote:
On Wed, 13 Nov 2019 08:24:41 +0100, Chih-Yang Hsia wrote:
On Wed, Nov 13, 2019 at 2:16 AM Takashi Iwai tiwai@suse.de wrote:
On Tue, 12 Nov 2019 18:17:13 +0100, paulhsia wrote:
Since
- snd_pcm_detach_substream sets runtime to null without stream lock and
- snd_pcm_period_elapsed checks the nullity of the runtime outside of stream lock.
This will trigger null memory access in snd_pcm_running() call in snd_pcm_period_elapsed.
Well, if a stream is detached, it means that the stream must have been already closed; i.e. it's already a clear bug in the driver that snd_pcm_period_elapsed() is called against such a stream.
Or am I missing other possible case?
thanks,
Takashi
In multithreaded environment, it is possible to have to access both `interrupt_handler` (from irq) and `substream close` (from snd_pcm_release) at the same time. Therefore, in driver implementation, if "substream close function" and the "code section where snd_pcm_period_elapsed() in" do not hold the same lock, then the following things can happen:
- interrupt_handler -> goes into snd_pcm_period_elapsed with a valid
sustream pointer 2. snd_pcm_release_substream: call close without blocking 3. snd_pcm_release_substream: call snd_pcm_detache_substream and set substream->runtime to NULL 4. interrupt_handler -> call snd_pcm_runtime() and crash while accessing fields in `substream->runtime`
e.g. In intel8x0.c driver for ac97 device, In driver intel8x0.c, `snd_pcm_period_elapsed` is called after checking `ichdev->substream` in `snd_intel8x0_update`. And if a `snd_pcm_release` call from alsa-lib and pass through close() and run to snd_pcm_detach_substream() in another thread, it's possible to trigger a crash. I can reproduce the issue within a multithread VM easily.
My patches are trying to provide a basic protection for this situation (and internal pcm lock between detach and elapsed), since
- the usage of `snd_pcm_period_elapsed` does not warn callers about
the possible race if the driver does not force the order for `calling snd_pcm_period_elapsed` and `close` by lock and
- lots of drivers already have this hidden issue and I can't fix them
one by one (You can check the "snd_pcm_period_elapsed usage" and the "close implementation" within all the drivers). The most common mistake is that
- Checking if the substream is null and call into snd_pcm_period_elapsed
- But `close` can happen anytime, pass without block and
snd_pcm_detach_substream will be trigger right after it
Thanks, point taken. While this argument is valid and it's good to harden the PCM core side, the concurrent calls are basically a bug, and we'd need another fix in anyway. Also, the patch 2 makes little sense; there can't be multiple close calls racing with each other. So I'll go for taking your fix but only the first patch.
Back to this race: the surfaced issue is, as you pointed out, the race between snd_pcm_period_elapsed() vs close call. However, the fundamental problem is the pending action after the PCM trigger-stop call. Since the PCM trigger doesn't block nor wait until the hardware actually stops the things, the driver may go to the other step even after this "supposed-to-be-stopped" point. In your case, it goes up to close, and crashes. If we had a sync-stop operation, the interrupt handler should have finished before moving to the close stage, hence such a race could be avoided.
It's been a long known problem, and some drivers have the own implementation for stop-sync. I think it's time to investigate and start implementing the fundamental solution.
BTW, what we need essentially for intel8x0 is to just call synchronize_irq() before closing, at best in hw_free procedure:
--- a/sound/pci/intel8x0.c +++ b/sound/pci/intel8x0.c @@ -923,8 +923,10 @@ static int snd_intel8x0_hw_params(struct snd_pcm_substream *substream,
static int snd_intel8x0_hw_free(struct snd_pcm_substream *substream) {
struct intel8x0 *chip = snd_pcm_substream_chip(substream); struct ichdev *ichdev = get_ichdev(substream);
synchronize_irq(chip->irq); if (ichdev->pcm_open_flag) { snd_ac97_pcm_close(ichdev->pcm); ichdev->pcm_open_flag = 0;
The same would be needed also at the beginning of the prepare, as the application may restart the stream without release.
My idea is to add sync_stop PCM ops and call it from PCM core at snd_pcm_prepare() and snd_pcm_hw_free().
Will adding synchronize_irq() in snd_pcm_hw_free there fix the race issue? Is it possible to have sequence like the following steps ? - [Thread 1] snd_pcm_hw_free: just pass synchronize_irq() - [Thread 2] another interrupt come -> snd_intel8x0_update() -> goes into the lock region of snd_pcm_period_elapsed() and passes the PCM_RUNTIME_CHECK (right before snd_pcm_running()) - [Thread 1] snd_pcm_hw_free finished() -> snd_pcm_detach_substream() -> runtime=NULL - [Thread 2] Execute snd_pcm_running and crash
I can't trigger the issue after adding the synchronize_irq(), but maybe it's just luck. Correct my if I miss something.
Thanks, Paul
thanks,
Takashi
On Thu, 14 Nov 2019 15:16:04 +0100, Chih-Yang Hsia wrote:
On Wed, Nov 13, 2019 at 7:36 PM Takashi Iwai tiwai@suse.de wrote:
On Wed, 13 Nov 2019 10:47:51 +0100, Takashi Iwai wrote:
On Wed, 13 Nov 2019 08:24:41 +0100, Chih-Yang Hsia wrote:
On Wed, Nov 13, 2019 at 2:16 AM Takashi Iwai tiwai@suse.de wrote:
On Tue, 12 Nov 2019 18:17:13 +0100, paulhsia wrote:
Since
- snd_pcm_detach_substream sets runtime to null without stream lock and
- snd_pcm_period_elapsed checks the nullity of the runtime outside of stream lock.
This will trigger null memory access in snd_pcm_running() call in snd_pcm_period_elapsed.
Well, if a stream is detached, it means that the stream must have been already closed; i.e. it's already a clear bug in the driver that snd_pcm_period_elapsed() is called against such a stream.
Or am I missing other possible case?
thanks,
Takashi
In multithreaded environment, it is possible to have to access both `interrupt_handler` (from irq) and `substream close` (from snd_pcm_release) at the same time. Therefore, in driver implementation, if "substream close function" and the "code section where snd_pcm_period_elapsed() in" do not hold the same lock, then the following things can happen:
- interrupt_handler -> goes into snd_pcm_period_elapsed with a valid
sustream pointer 2. snd_pcm_release_substream: call close without blocking 3. snd_pcm_release_substream: call snd_pcm_detache_substream and set substream->runtime to NULL 4. interrupt_handler -> call snd_pcm_runtime() and crash while accessing fields in `substream->runtime`
e.g. In intel8x0.c driver for ac97 device, In driver intel8x0.c, `snd_pcm_period_elapsed` is called after checking `ichdev->substream` in `snd_intel8x0_update`. And if a `snd_pcm_release` call from alsa-lib and pass through close() and run to snd_pcm_detach_substream() in another thread, it's possible to trigger a crash. I can reproduce the issue within a multithread VM easily.
My patches are trying to provide a basic protection for this situation (and internal pcm lock between detach and elapsed), since
- the usage of `snd_pcm_period_elapsed` does not warn callers about
the possible race if the driver does not force the order for `calling snd_pcm_period_elapsed` and `close` by lock and
- lots of drivers already have this hidden issue and I can't fix them
one by one (You can check the "snd_pcm_period_elapsed usage" and the "close implementation" within all the drivers). The most common mistake is that
- Checking if the substream is null and call into snd_pcm_period_elapsed
- But `close` can happen anytime, pass without block and
snd_pcm_detach_substream will be trigger right after it
Thanks, point taken. While this argument is valid and it's good to harden the PCM core side, the concurrent calls are basically a bug, and we'd need another fix in anyway. Also, the patch 2 makes little sense; there can't be multiple close calls racing with each other. So I'll go for taking your fix but only the first patch.
Back to this race: the surfaced issue is, as you pointed out, the race between snd_pcm_period_elapsed() vs close call. However, the fundamental problem is the pending action after the PCM trigger-stop call. Since the PCM trigger doesn't block nor wait until the hardware actually stops the things, the driver may go to the other step even after this "supposed-to-be-stopped" point. In your case, it goes up to close, and crashes. If we had a sync-stop operation, the interrupt handler should have finished before moving to the close stage, hence such a race could be avoided.
It's been a long known problem, and some drivers have the own implementation for stop-sync. I think it's time to investigate and start implementing the fundamental solution.
BTW, what we need essentially for intel8x0 is to just call synchronize_irq() before closing, at best in hw_free procedure:
--- a/sound/pci/intel8x0.c +++ b/sound/pci/intel8x0.c @@ -923,8 +923,10 @@ static int snd_intel8x0_hw_params(struct snd_pcm_substream *substream,
static int snd_intel8x0_hw_free(struct snd_pcm_substream *substream) {
struct intel8x0 *chip = snd_pcm_substream_chip(substream); struct ichdev *ichdev = get_ichdev(substream);
synchronize_irq(chip->irq); if (ichdev->pcm_open_flag) { snd_ac97_pcm_close(ichdev->pcm); ichdev->pcm_open_flag = 0;
The same would be needed also at the beginning of the prepare, as the application may restart the stream without release.
My idea is to add sync_stop PCM ops and call it from PCM core at snd_pcm_prepare() and snd_pcm_hw_free().
Will adding synchronize_irq() in snd_pcm_hw_free there fix the race issue? Is it possible to have sequence like the following steps ?
- [Thread 1] snd_pcm_hw_free: just pass synchronize_irq()
- [Thread 2] another interrupt come -> snd_intel8x0_update() -> goes
into the lock region of snd_pcm_period_elapsed() and passes the PCM_RUNTIME_CHECK (right before snd_pcm_running())
This shouldn't happen because at the point snd_pcm_hw_free() the stream has been already in the SETUP state, i.e. with trigger PCM callback, the hardware has been programmed not to generate the PCM stream IRQ.
Takashi
- [Thread 1] snd_pcm_hw_free finished() -> snd_pcm_detach_substream()
-> runtime=NULL
- [Thread 2] Execute snd_pcm_running and crash
I can't trigger the issue after adding the synchronize_irq(), but maybe it's just luck. Correct my if I miss something.
Thanks, Paul
thanks,
Takashi
On Thu, Nov 14, 2019 at 10:20 PM Takashi Iwai tiwai@suse.de wrote:
On Thu, 14 Nov 2019 15:16:04 +0100, Chih-Yang Hsia wrote:
On Wed, Nov 13, 2019 at 7:36 PM Takashi Iwai tiwai@suse.de wrote:
On Wed, 13 Nov 2019 10:47:51 +0100, Takashi Iwai wrote:
On Wed, 13 Nov 2019 08:24:41 +0100, Chih-Yang Hsia wrote:
On Wed, Nov 13, 2019 at 2:16 AM Takashi Iwai tiwai@suse.de wrote:
On Tue, 12 Nov 2019 18:17:13 +0100, paulhsia wrote: > > Since > - snd_pcm_detach_substream sets runtime to null without stream lock and > - snd_pcm_period_elapsed checks the nullity of the runtime outside of > stream lock. > > This will trigger null memory access in snd_pcm_running() call in > snd_pcm_period_elapsed.
Well, if a stream is detached, it means that the stream must have been already closed; i.e. it's already a clear bug in the driver that snd_pcm_period_elapsed() is called against such a stream.
Or am I missing other possible case?
thanks,
Takashi
In multithreaded environment, it is possible to have to access both `interrupt_handler` (from irq) and `substream close` (from snd_pcm_release) at the same time. Therefore, in driver implementation, if "substream close function" and the "code section where snd_pcm_period_elapsed() in" do not hold the same lock, then the following things can happen:
- interrupt_handler -> goes into snd_pcm_period_elapsed with a valid
sustream pointer 2. snd_pcm_release_substream: call close without blocking 3. snd_pcm_release_substream: call snd_pcm_detache_substream and set substream->runtime to NULL 4. interrupt_handler -> call snd_pcm_runtime() and crash while accessing fields in `substream->runtime`
e.g. In intel8x0.c driver for ac97 device, In driver intel8x0.c, `snd_pcm_period_elapsed` is called after checking `ichdev->substream` in `snd_intel8x0_update`. And if a `snd_pcm_release` call from alsa-lib and pass through close() and run to snd_pcm_detach_substream() in another thread, it's possible to trigger a crash. I can reproduce the issue within a multithread VM easily.
My patches are trying to provide a basic protection for this situation (and internal pcm lock between detach and elapsed), since
- the usage of `snd_pcm_period_elapsed` does not warn callers about
the possible race if the driver does not force the order for `calling snd_pcm_period_elapsed` and `close` by lock and
- lots of drivers already have this hidden issue and I can't fix them
one by one (You can check the "snd_pcm_period_elapsed usage" and the "close implementation" within all the drivers). The most common mistake is that
- Checking if the substream is null and call into snd_pcm_period_elapsed
- But `close` can happen anytime, pass without block and
snd_pcm_detach_substream will be trigger right after it
Thanks, point taken. While this argument is valid and it's good to harden the PCM core side, the concurrent calls are basically a bug, and we'd need another fix in anyway. Also, the patch 2 makes little sense; there can't be multiple close calls racing with each other. So I'll go for taking your fix but only the first patch.
Back to this race: the surfaced issue is, as you pointed out, the race between snd_pcm_period_elapsed() vs close call. However, the fundamental problem is the pending action after the PCM trigger-stop call. Since the PCM trigger doesn't block nor wait until the hardware actually stops the things, the driver may go to the other step even after this "supposed-to-be-stopped" point. In your case, it goes up to close, and crashes. If we had a sync-stop operation, the interrupt handler should have finished before moving to the close stage, hence such a race could be avoided.
It's been a long known problem, and some drivers have the own implementation for stop-sync. I think it's time to investigate and start implementing the fundamental solution.
BTW, what we need essentially for intel8x0 is to just call synchronize_irq() before closing, at best in hw_free procedure:
--- a/sound/pci/intel8x0.c +++ b/sound/pci/intel8x0.c @@ -923,8 +923,10 @@ static int snd_intel8x0_hw_params(struct snd_pcm_substream *substream,
static int snd_intel8x0_hw_free(struct snd_pcm_substream *substream) {
struct intel8x0 *chip = snd_pcm_substream_chip(substream); struct ichdev *ichdev = get_ichdev(substream);
synchronize_irq(chip->irq); if (ichdev->pcm_open_flag) { snd_ac97_pcm_close(ichdev->pcm); ichdev->pcm_open_flag = 0;
The same would be needed also at the beginning of the prepare, as the application may restart the stream without release.
My idea is to add sync_stop PCM ops and call it from PCM core at snd_pcm_prepare() and snd_pcm_hw_free().
Will adding synchronize_irq() in snd_pcm_hw_free there fix the race issue? Is it possible to have sequence like the following steps ?
- [Thread 1] snd_pcm_hw_free: just pass synchronize_irq()
- [Thread 2] another interrupt come -> snd_intel8x0_update() -> goes
into the lock region of snd_pcm_period_elapsed() and passes the PCM_RUNTIME_CHECK (right before snd_pcm_running())
This shouldn't happen because at the point snd_pcm_hw_free() the stream has been already in the SETUP state, i.e. with trigger PCM callback, the hardware has been programmed not to generate the PCM stream IRQ.
Thanks for pointing that out. snd_pcm_drop() will be called right before accessing `opts->hw_free` and device dma will be stopped by SNDRV_PCM_TRIGGER_STOP. And snd_pcm_prepare() will be called when the device is not running. So synchronize_irq() should be enough for both of them.
I have a patch like this now in intel8x0:
diff --git a/sound/pci/intel8x0.c b/sound/pci/intel8x0.c index 6ff94d8ad86e..728588937673 100644 --- a/sound/pci/intel8x0.c +++ b/sound/pci/intel8x0.c @@ -923,8 +923,10 @@ static int snd_intel8x0_hw_params(struct snd_pcm_substream *substream,
static int snd_intel8x0_hw_free(struct snd_pcm_substream *substream) { + struct intel8x0 *chip = snd_pcm_substream_chip(substream); struct ichdev *ichdev = get_ichdev(substream);
+ synchronize_irq(chip->irq); if (ichdev->pcm_open_flag) { snd_ac97_pcm_close(ichdev->pcm); ichdev->pcm_open_flag = 0; @@ -993,6 +995,7 @@ static int snd_intel8x0_pcm_prepare(struct snd_pcm_substream *substream) struct snd_pcm_runtime *runtime = substream->runtime; struct ichdev *ichdev = get_ichdev(substream);
+ synchronize_irq(chip->irq); ichdev->physbuf = runtime->dma_addr; ichdev->size = snd_pcm_lib_buffer_bytes(substream); ichdev->fragsize = snd_pcm_lib_period_bytes(substream);
If that looks good to you, I can upload the patch to pw as well. Then we can upstream the intel8x0 patch and the first change I made in this series (the elapse one). Does that sound good to you?
Thanks, Paul
Takashi
- [Thread 1] snd_pcm_hw_free finished() -> snd_pcm_detach_substream()
-> runtime=NULL
- [Thread 2] Execute snd_pcm_running and crash
I can't trigger the issue after adding the synchronize_irq(), but maybe it's just luck. Correct my if I miss something.
Thanks, Paul
thanks,
Takashi
On Thu, 14 Nov 2019 17:37:54 +0100, Chih-Yang Hsia wrote:
On Thu, Nov 14, 2019 at 10:20 PM Takashi Iwai tiwai@suse.de wrote:
On Thu, 14 Nov 2019 15:16:04 +0100, Chih-Yang Hsia wrote:
On Wed, Nov 13, 2019 at 7:36 PM Takashi Iwai tiwai@suse.de wrote:
On Wed, 13 Nov 2019 10:47:51 +0100, Takashi Iwai wrote:
On Wed, 13 Nov 2019 08:24:41 +0100, Chih-Yang Hsia wrote:
On Wed, Nov 13, 2019 at 2:16 AM Takashi Iwai tiwai@suse.de wrote: > > On Tue, 12 Nov 2019 18:17:13 +0100, > paulhsia wrote: > > > > Since > > - snd_pcm_detach_substream sets runtime to null without stream lock and > > - snd_pcm_period_elapsed checks the nullity of the runtime outside of > > stream lock. > > > > This will trigger null memory access in snd_pcm_running() call in > > snd_pcm_period_elapsed. > > Well, if a stream is detached, it means that the stream must have been > already closed; i.e. it's already a clear bug in the driver that > snd_pcm_period_elapsed() is called against such a stream. > > Or am I missing other possible case? > > > thanks, > > Takashi >
In multithreaded environment, it is possible to have to access both `interrupt_handler` (from irq) and `substream close` (from snd_pcm_release) at the same time. Therefore, in driver implementation, if "substream close function" and the "code section where snd_pcm_period_elapsed() in" do not hold the same lock, then the following things can happen:
- interrupt_handler -> goes into snd_pcm_period_elapsed with a valid
sustream pointer 2. snd_pcm_release_substream: call close without blocking 3. snd_pcm_release_substream: call snd_pcm_detache_substream and set substream->runtime to NULL 4. interrupt_handler -> call snd_pcm_runtime() and crash while accessing fields in `substream->runtime`
e.g. In intel8x0.c driver for ac97 device, In driver intel8x0.c, `snd_pcm_period_elapsed` is called after checking `ichdev->substream` in `snd_intel8x0_update`. And if a `snd_pcm_release` call from alsa-lib and pass through close() and run to snd_pcm_detach_substream() in another thread, it's possible to trigger a crash. I can reproduce the issue within a multithread VM easily.
My patches are trying to provide a basic protection for this situation (and internal pcm lock between detach and elapsed), since
- the usage of `snd_pcm_period_elapsed` does not warn callers about
the possible race if the driver does not force the order for `calling snd_pcm_period_elapsed` and `close` by lock and
- lots of drivers already have this hidden issue and I can't fix them
one by one (You can check the "snd_pcm_period_elapsed usage" and the "close implementation" within all the drivers). The most common mistake is that
- Checking if the substream is null and call into snd_pcm_period_elapsed
- But `close` can happen anytime, pass without block and
snd_pcm_detach_substream will be trigger right after it
Thanks, point taken. While this argument is valid and it's good to harden the PCM core side, the concurrent calls are basically a bug, and we'd need another fix in anyway. Also, the patch 2 makes little sense; there can't be multiple close calls racing with each other. So I'll go for taking your fix but only the first patch.
Back to this race: the surfaced issue is, as you pointed out, the race between snd_pcm_period_elapsed() vs close call. However, the fundamental problem is the pending action after the PCM trigger-stop call. Since the PCM trigger doesn't block nor wait until the hardware actually stops the things, the driver may go to the other step even after this "supposed-to-be-stopped" point. In your case, it goes up to close, and crashes. If we had a sync-stop operation, the interrupt handler should have finished before moving to the close stage, hence such a race could be avoided.
It's been a long known problem, and some drivers have the own implementation for stop-sync. I think it's time to investigate and start implementing the fundamental solution.
BTW, what we need essentially for intel8x0 is to just call synchronize_irq() before closing, at best in hw_free procedure:
--- a/sound/pci/intel8x0.c +++ b/sound/pci/intel8x0.c @@ -923,8 +923,10 @@ static int snd_intel8x0_hw_params(struct snd_pcm_substream *substream,
static int snd_intel8x0_hw_free(struct snd_pcm_substream *substream) {
struct intel8x0 *chip = snd_pcm_substream_chip(substream); struct ichdev *ichdev = get_ichdev(substream);
synchronize_irq(chip->irq); if (ichdev->pcm_open_flag) { snd_ac97_pcm_close(ichdev->pcm); ichdev->pcm_open_flag = 0;
The same would be needed also at the beginning of the prepare, as the application may restart the stream without release.
My idea is to add sync_stop PCM ops and call it from PCM core at snd_pcm_prepare() and snd_pcm_hw_free().
Will adding synchronize_irq() in snd_pcm_hw_free there fix the race issue? Is it possible to have sequence like the following steps ?
- [Thread 1] snd_pcm_hw_free: just pass synchronize_irq()
- [Thread 2] another interrupt come -> snd_intel8x0_update() -> goes
into the lock region of snd_pcm_period_elapsed() and passes the PCM_RUNTIME_CHECK (right before snd_pcm_running())
This shouldn't happen because at the point snd_pcm_hw_free() the stream has been already in the SETUP state, i.e. with trigger PCM callback, the hardware has been programmed not to generate the PCM stream IRQ.
Thanks for pointing that out. snd_pcm_drop() will be called right before accessing `opts->hw_free` and device dma will be stopped by SNDRV_PCM_TRIGGER_STOP. And snd_pcm_prepare() will be called when the device is not running. So synchronize_irq() should be enough for both of them.
I have a patch like this now in intel8x0:
diff --git a/sound/pci/intel8x0.c b/sound/pci/intel8x0.c index 6ff94d8ad86e..728588937673 100644 --- a/sound/pci/intel8x0.c +++ b/sound/pci/intel8x0.c @@ -923,8 +923,10 @@ static int snd_intel8x0_hw_params(struct snd_pcm_substream *substream,
static int snd_intel8x0_hw_free(struct snd_pcm_substream *substream) {
struct intel8x0 *chip = snd_pcm_substream_chip(substream); struct ichdev *ichdev = get_ichdev(substream);
synchronize_irq(chip->irq); if (ichdev->pcm_open_flag) { snd_ac97_pcm_close(ichdev->pcm); ichdev->pcm_open_flag = 0;
@@ -993,6 +995,7 @@ static int snd_intel8x0_pcm_prepare(struct snd_pcm_substream *substream) struct snd_pcm_runtime *runtime = substream->runtime; struct ichdev *ichdev = get_ichdev(substream);
synchronize_irq(chip->irq); ichdev->physbuf = runtime->dma_addr; ichdev->size = snd_pcm_lib_buffer_bytes(substream); ichdev->fragsize = snd_pcm_lib_period_bytes(substream);
If that looks good to you, I can upload the patch to pw as well. Then we can upstream the intel8x0 patch and the first change I made in this series (the elapse one). Does that sound good to you?
I already have a patch set that adds the irq-sync commonly, as this problem is seen on various drivers as you already pointed.
Below two patches add the support in PCM core side, and the rest need in intel8x0.c is something like:
--- a/sound/pci/intel8x0.c +++ b/sound/pci/intel8x0.c @@ -3092,6 +3092,7 @@ static int snd_intel8x0_create(struct snd_card *card, return -EBUSY; } chip->irq = pci->irq; + card->sync_irq = chip->irq;
if ((err = snd_device_new(card, SNDRV_DEV_LOWLEVEL, chip, &ops)) < 0) { snd_intel8x0_free(chip);
(The intel8x0 does re-acquire IRQ, so it'll need a bit more lines, but you get the idea.)
My plan is to merge the whole changes after 5.5-rc1, destined for 5.6.
thanks,
Takashi
On Fri, Nov 15, 2019 at 1:00 AM Takashi Iwai tiwai@suse.de wrote:
On Thu, 14 Nov 2019 17:37:54 +0100, Chih-Yang Hsia wrote:
On Thu, Nov 14, 2019 at 10:20 PM Takashi Iwai tiwai@suse.de wrote:
On Thu, 14 Nov 2019 15:16:04 +0100, Chih-Yang Hsia wrote:
On Wed, Nov 13, 2019 at 7:36 PM Takashi Iwai tiwai@suse.de wrote:
On Wed, 13 Nov 2019 10:47:51 +0100, Takashi Iwai wrote:
On Wed, 13 Nov 2019 08:24:41 +0100, Chih-Yang Hsia wrote: > > On Wed, Nov 13, 2019 at 2:16 AM Takashi Iwai tiwai@suse.de wrote: > > > > On Tue, 12 Nov 2019 18:17:13 +0100, > > paulhsia wrote: > > > > > > Since > > > - snd_pcm_detach_substream sets runtime to null without stream lock and > > > - snd_pcm_period_elapsed checks the nullity of the runtime outside of > > > stream lock. > > > > > > This will trigger null memory access in snd_pcm_running() call in > > > snd_pcm_period_elapsed. > > > > Well, if a stream is detached, it means that the stream must have been > > already closed; i.e. it's already a clear bug in the driver that > > snd_pcm_period_elapsed() is called against such a stream. > > > > Or am I missing other possible case? > > > > > > thanks, > > > > Takashi > > > > In multithreaded environment, it is possible to have to access both > `interrupt_handler` (from irq) and `substream close` (from > snd_pcm_release) at the same time. > Therefore, in driver implementation, if "substream close function" and > the "code section where snd_pcm_period_elapsed() in" do not hold the > same lock, then the following things can happen: > > 1. interrupt_handler -> goes into snd_pcm_period_elapsed with a valid > sustream pointer > 2. snd_pcm_release_substream: call close without blocking > 3. snd_pcm_release_substream: call snd_pcm_detache_substream and set > substream->runtime to NULL > 4. interrupt_handler -> call snd_pcm_runtime() and crash while > accessing fields in `substream->runtime` > > e.g. In intel8x0.c driver for ac97 device, > In driver intel8x0.c, `snd_pcm_period_elapsed` is called after > checking `ichdev->substream` in `snd_intel8x0_update`. > And if a `snd_pcm_release` call from alsa-lib and pass through close() > and run to snd_pcm_detach_substream() in another thread, it's possible > to trigger a crash. > I can reproduce the issue within a multithread VM easily. > > My patches are trying to provide a basic protection for this situation > (and internal pcm lock between detach and elapsed), since > - the usage of `snd_pcm_period_elapsed` does not warn callers about > the possible race if the driver does not force the order for `calling > snd_pcm_period_elapsed` and `close` by lock and > - lots of drivers already have this hidden issue and I can't fix them > one by one (You can check the "snd_pcm_period_elapsed usage" and the > "close implementation" within all the drivers). The most common > mistake is that > - Checking if the substream is null and call into snd_pcm_period_elapsed > - But `close` can happen anytime, pass without block and > snd_pcm_detach_substream will be trigger right after it
Thanks, point taken. While this argument is valid and it's good to harden the PCM core side, the concurrent calls are basically a bug, and we'd need another fix in anyway. Also, the patch 2 makes little sense; there can't be multiple close calls racing with each other. So I'll go for taking your fix but only the first patch.
Back to this race: the surfaced issue is, as you pointed out, the race between snd_pcm_period_elapsed() vs close call. However, the fundamental problem is the pending action after the PCM trigger-stop call. Since the PCM trigger doesn't block nor wait until the hardware actually stops the things, the driver may go to the other step even after this "supposed-to-be-stopped" point. In your case, it goes up to close, and crashes. If we had a sync-stop operation, the interrupt handler should have finished before moving to the close stage, hence such a race could be avoided.
It's been a long known problem, and some drivers have the own implementation for stop-sync. I think it's time to investigate and start implementing the fundamental solution.
BTW, what we need essentially for intel8x0 is to just call synchronize_irq() before closing, at best in hw_free procedure:
--- a/sound/pci/intel8x0.c +++ b/sound/pci/intel8x0.c @@ -923,8 +923,10 @@ static int snd_intel8x0_hw_params(struct snd_pcm_substream *substream,
static int snd_intel8x0_hw_free(struct snd_pcm_substream *substream) {
struct intel8x0 *chip = snd_pcm_substream_chip(substream); struct ichdev *ichdev = get_ichdev(substream);
synchronize_irq(chip->irq); if (ichdev->pcm_open_flag) { snd_ac97_pcm_close(ichdev->pcm); ichdev->pcm_open_flag = 0;
The same would be needed also at the beginning of the prepare, as the application may restart the stream without release.
My idea is to add sync_stop PCM ops and call it from PCM core at snd_pcm_prepare() and snd_pcm_hw_free().
Will adding synchronize_irq() in snd_pcm_hw_free there fix the race issue? Is it possible to have sequence like the following steps ?
- [Thread 1] snd_pcm_hw_free: just pass synchronize_irq()
- [Thread 2] another interrupt come -> snd_intel8x0_update() -> goes
into the lock region of snd_pcm_period_elapsed() and passes the PCM_RUNTIME_CHECK (right before snd_pcm_running())
This shouldn't happen because at the point snd_pcm_hw_free() the stream has been already in the SETUP state, i.e. with trigger PCM callback, the hardware has been programmed not to generate the PCM stream IRQ.
Thanks for pointing that out. snd_pcm_drop() will be called right before accessing `opts->hw_free` and device dma will be stopped by SNDRV_PCM_TRIGGER_STOP. And snd_pcm_prepare() will be called when the device is not running. So synchronize_irq() should be enough for both of them.
I have a patch like this now in intel8x0:
diff --git a/sound/pci/intel8x0.c b/sound/pci/intel8x0.c index 6ff94d8ad86e..728588937673 100644 --- a/sound/pci/intel8x0.c +++ b/sound/pci/intel8x0.c @@ -923,8 +923,10 @@ static int snd_intel8x0_hw_params(struct snd_pcm_substream *substream,
static int snd_intel8x0_hw_free(struct snd_pcm_substream *substream) {
struct intel8x0 *chip = snd_pcm_substream_chip(substream); struct ichdev *ichdev = get_ichdev(substream);
synchronize_irq(chip->irq); if (ichdev->pcm_open_flag) { snd_ac97_pcm_close(ichdev->pcm); ichdev->pcm_open_flag = 0;
@@ -993,6 +995,7 @@ static int snd_intel8x0_pcm_prepare(struct snd_pcm_substream *substream) struct snd_pcm_runtime *runtime = substream->runtime; struct ichdev *ichdev = get_ichdev(substream);
synchronize_irq(chip->irq); ichdev->physbuf = runtime->dma_addr; ichdev->size = snd_pcm_lib_buffer_bytes(substream); ichdev->fragsize = snd_pcm_lib_period_bytes(substream);
If that looks good to you, I can upload the patch to pw as well. Then we can upstream the intel8x0 patch and the first change I made in this series (the elapse one). Does that sound good to you?
I already have a patch set that adds the irq-sync commonly, as this problem is seen on various drivers as you already pointed.
Below two patches add the support in PCM core side, and the rest need in intel8x0.c is something like:
--- a/sound/pci/intel8x0.c +++ b/sound/pci/intel8x0.c @@ -3092,6 +3092,7 @@ static int snd_intel8x0_create(struct snd_card *card, return -EBUSY; } chip->irq = pci->irq;
card->sync_irq = chip->irq;
Will this assignment or removement cause possible race if the driver is careless? Maybe providing some helper functions or teaching driver writers when is the right time to change or remove the sync_irq will help.
Best, Paul
if ((err = snd_device_new(card, SNDRV_DEV_LOWLEVEL, chip, &ops)) < 0) { snd_intel8x0_free(chip);
(The intel8x0 does re-acquire IRQ, so it'll need a bit more lines, but you get the idea.)
My plan is to merge the whole changes after 5.5-rc1, destined for 5.6.
thanks,
Takashi
participants (3)
-
Chih-Yang Hsia
-
paulhsia
-
Takashi Iwai