Adding movable PCIe BARs support in snd_hda_intel

Andrey Grodzovsky andrey.grodzovsky at amd.com
Tue Mar 23 18:08:14 CET 2021



On 2021-03-23 12:50 p.m., Takashi Iwai wrote:
> On Tue, 23 Mar 2021 17:11:20 +0100,
> Andrey Grodzovsky wrote:
>>
>>
>>
>> On 2021-03-23 10:54 a.m., Takashi Iwai wrote:
>>> On Tue, 23 Mar 2021 15:22:01 +0100,
>>> Andrey Grodzovsky wrote:
>>>>
>>>>
>>>>
>>>> On 2021-03-23 7:39 a.m., Takashi Iwai wrote:
>>>>> On Tue, 23 Mar 2021 12:23:16 +0100,
>>>>> Andrey Grodzovsky wrote:
>>>>>>
>>>>>> Just an update, i found the issue which was actually to wake up the HW
>>>>>> before doing stop/restart (using pm_runtime_get_sync), also handled
>>>>>> protecting from concurrent snd_pcm_ioctls accessing the registers
>>>>>> while the BAR is unmapped. Can go through BAR move while aplay is
>>>>>> running now.
>>>>>>
>>>>>> Once again, would be happy for any comments on the code -
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Flog%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C0bda8b151e46406f541308d8ee1bc360%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637521150744943554%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pZEuJv8898ad52gYIskao38J01dLdH1PEwSC5h3tgdM%3D&reserved=0
>>>>>
>>>>> Hrm, this kind of change is a bit scary; we have some mechanism to
>>>>> stop the HD-audio hardware operation (e.g. for loading the DSP for
>>>>> CA0132 codec), but it's pretty hackish and error-prone.
>>>>
>>>> At least I didn't have issue with HW stopping, the problem is with
>>>> unampping hdac_bus->remap_addr - as long as it's not iorampped
>>>> back no code besides the PCI BAR move should run at all as any register
>>>> access from this address will cause fatal page fault. I took care
>>>> of most of the IOCTls i think, the interrupts are disabled in azx_stop_chip
>>>> but I am probably still missing some stuff like some background work
>>>> items or other IOCTls code path.
>>>
>>> Are the all running processes stopped before entering rescan_prepare
>>> callback?  Otherwise, e.g. a process accessing PCM stream via mmap can
>>> run without ioctl and this may trigger the PCM period elapsed
>>> eventually.
>>
>> Is any of them remaps the BAR into user process VMA ? The only two I see
>> in the sound subsystem doing remapping are are had_pcm_mmap and
>> snd_pcm_lib_default_mmap
>> and both of them remap dma buffers which are in system memory and not
>> MMIO space and  hence should not be impacted by the BAR move.
> 
> When a stream is running, it can cause an interrupt that may issue an
> unsolicited event that is deals with CORB/RIRB later in a work
> asynchronously.

On rescan_prepare I am calling snd_hdac_bus_stop_chip which in turn
seems to disable interrupts - is this enough to guarantee all those
won't be happening past that code ? Should I also flush any work queue
items scheduled from within those ISRs before proceeding ?

> 
>> Or maybe I didn't get your concern correctly ? Why there would be a
>> problem to access the stream otherwise during the BAR move ?
>>
>> Similarly, a control interface may issue the verb that is
>>> processed via CORB/RIRB, too.  So what you need to cover is not only
>>> about PCM.
>>
>> Can you point me to the code path for this - is this through another
>> type of IOCTL or interrupt handlers ?
> 
> Yes, it's handled in the ALSA core control ioctls.  Many of the
> control *_get()/*_put() callbacks in HD-audio are involved with the
> amp or other HD-audio verbs (typically calling snd_hda_codec_amp*(),
> etc).  Those are the mixer elements and handled via regmap with
> HD-audio backend that deals with CORB/RIRB eventually.
> 
>>>>> Do we really need the BAR remap while operating the sound streaming?
>>>>> That is, can't we simply refuse the remap if it's in operation?
>>>>> If it can be refused, we may avoid a big locking or such.
>>>>
>>>> Problem here is that this feature is for the sake of hot-plug of a
>>>> device, this supposed to give enough MMIO space to place all the BARs
>>>> of newly plugging in device. Refusing to move BARs because there is some
>>>> app using the sound card on the background might then lead to failure of
>>>> plugging in the new device and this doesn't sound as a reasonable
>>>> behavior to me. Also note that the lock's impact is minimal as it's
>>>> read side lock only in the IOCTLs an so as long as hot plug not
>>>> taking place it has no impact.
>>>
>>> But a hot-plugged device can't have any running files, no?
>>> So at which timing is the BAR movement _must_ happen?
>>
>> No, the devices this code is written for are those already in the
>> system before the new device is plugged in, the new device needs enough
>> free MMIO space for his own BARs and so this features enables existing
>> devices to allow the PCI core to move their BARs within the  physical
>> address space such that enough free space will be available inside
>> the MMIO window allocated for the new device's upstream PCI bridge.
> 
> Hmm, OK, so it's other way round as I thought of.
> 
>>> And, even if we need some synchronization, the current lock
>>> implementation looks fragile to me.  We likely overlook something when
>>> trying to cover each branch code path.
>>>
>>> Basically it needs to wait until all running processes via PCM or
>>> control API are released.
>>
>> This approach brings a couple of hard questions -
>>
>> How we make them stop at once - they might run for long time, you
>> suggest something like invalidate all their mmaps and on next page
>> fault return SIGBUS to terminate them ? Then I need also to know they
>> actually are done - should I wait for their device file instance
>> release? What if they don't access any mmapped are for a long time ?
>> Sending SIGKILL explicitly ? But do I have the task id's of all of them ?
> 
> You shouldn't kill them but wait for them aborting gracefully, IMO.
> Essentially a kind of hot-unplug.  Most of Linux drivers won't kill
> processes at the hot-unplug, AFAIK.

But see, the difference is such that on hot-plug I may just let those
process pointing at zombie device to hang around indefently because
I am not waiting for any output from them. That indeed what I am doing
for amdgou hot-unplug work. Here i need to blocking wait for them since
I need to finish the procedure so the PCI core can finish shuffling the
existing BARs around and by this freeing enough space for the new device
to be accepted. I user can't plug in his thunderbolt GPU and observe
that nothing is happening actually because we are stuck waiting for some
process just sitting idle.

> 
> But if it has to be done so, we have a list of file descriptors that
> are attached to the PCM and the control devices, too...

So you suggest iterate all of them, sigkill them and confirm all of them 
exit instead of using the rw_sem ? This will cover IOCTLs and any
mmapped accesses i guess. Interrupts we discussed above. What above any
possible background kernel work going on in dedicated threads or work 
items ? Any pointers there what should be blocked and waited for ?

Andrey

> 
>> We had those scenarios discussed during my initial work on GPU hot
>> unplug and came to a conclusssion that trying to wait for current
>> processes to die is not a good approach - see the this respone I got
>> at the time for example -
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2020-May%2F265467.html&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C0bda8b151e46406f541308d8ee1bc360%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637521150744953550%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kOBHOS8tmokHM23Hx2HH12NWgzadKv7lhuKoNhF2nFU%3D&reserved=0
> 
> But should this be seen rather as an analogy of hot-replug?  That
> won't kill processes.
> 
> 
> Takashi
> 
>>
>> Andrey
>>
>>    We have a mechanism to block the new opens
>>> like snd_hda_lock_devices().  We'll need an additional stuff to wait
>>> until all opened devices are released there, and this requires some
>>> help in the core code, I suppose.
>>> (Actually this hackish lock_devices/unlock_devices should have been in
>>>    the core stuff from the beginning...)
>>>
>>>
>>> thanks,
>>>
>>> Takashi
>>>
>>>>
>>>> Andrey
>>>>
>>>>>
>>>>>
>>>>> thanks,
>>>>>
>>>>> Takashi
>>>>>
>>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 2021-03-19 5:22 p.m., Andrey Grodzovsky wrote:
>>>>>>> Hi, I am working on adding AMD related drivers support for PCIe BARs
>>>>>>> move feature developed by Sergei
>>>>>>> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flinuxplumbersconf.org%2Fevent%2F7%2Fcontributions%2F847%2Fattachments%2F584%2F1035%2Flpc2020_sergmir.pdf&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C0bda8b151e46406f541308d8ee1bc360%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637521150744953550%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QbiCIxry93FNvFJLHNaqEAdPbFRe%2FctIf%2BlGAQg0XDI%3D&reserved=0).
>>>>>>>
>>>>>>> His feature is still not upstream, all his code and mine on top can
>>>>>>> be seen here -
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Flog%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C0bda8b151e46406f541308d8ee1bc360%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637521150744953550%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=G514jwRAVet2bFXPC7a1s0%2B0cRYhX7f45gesxQpcYfk%3D&reserved=0
>>>>>>>
>>>>>>> I did basic implementation fro amdgpu driver and now I am doing the
>>>>>>> same for snd_hda_intel to support our on GPU Azalia audio
>>>>>>> chips. This relevant commit is here -
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3D7ec0f60633e898cb941cebb3cd18aae1374fc365&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C0bda8b151e46406f541308d8ee1bc360%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637521150744953550%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zKywjLs5mCdQwlt%2Fjb0qe1aOlVb%2FWWf6eDqQ4m9WGSs%3D&reserved=0
>>>>>>>
>>>>>>> Any driver that wants to support movable BARs needs to implement
>>>>>>> rescan_prepare, rescan_done and bar_fixed callbacks where
>>>>>>> rescan_prepare needs to stop HW/SW and unamp all MMIO mappings and
>>>>>>> rescan_done needs to ioremap the BAR from it's new MMIO location and
>>>>>>> restart HW/SW.
>>>>>>>
>>>>>>> I am able currently to trigger BARs move by sysfs using
>>>>>>> "/sys/bus/pci/rescan" and the driver will go through the sequence I
>>>>>>> described above without hangs. Problem is that after this when i try
>>>>>>> to use mplayer I am getting following errors:
>>>>>>>
>>>>>>> andrey at andrey-test:~$ sudo mplayer -ao alsa:device=hw=0.3
>>>>>>> Downloads/file_example_MP3_5MG.mp3
>>>>>>> MPlayer 1.3.0 (Debian), built with gcc-9 (C) 2000-2016 MPlayer Team
>>>>>>> do_connect: could not connect to socket
>>>>>>> connect: No such file or directory
>>>>>>> Failed to open LIRC support. You will not be able to use your remote
>>>>>>> control.
>>>>>>>
>>>>>>> Playing Downloads/file_example_MP3_5MG.mp3.
>>>>>>> libavformat version 58.29.100 (external)
>>>>>>> Audio only file format detected.
>>>>>>> Load subtitles in Downloads/
>>>>>>> ==========================================================================
>>>>>>> Opening audio decoder: [mpg123] MPEG 1.0/2.0/2.5 layers I, II, III
>>>>>>> AUDIO: 44100 Hz, 2 ch, s16le, 320.0 kbit/22.68% (ratio: 40000->176400)
>>>>>>> Selected audio codec: [mpg123] afm: mpg123 (MPEG 1.0/2.0/2.5 layers
>>>>>>> I, II, III)
>>>>>>> ==========================================================================
>>>>>>> AO: [alsa] 44100Hz 2ch s16le (2 bytes per sample)
>>>>>>> Video: no video
>>>>>>> Starting playback...
>>>>>>> A:   0.1 (00.0) of 132.0 (02:12.0) ??,?%
>>>>>>> Audio device got stuck!
>>>>>>>
>>>>>>> and in dmesg I see
>>>>>>> [  365.355518] snd_hda_codec_hdmi hdaudioC0D0: Unable to sync
>>>>>>> register 0x2f0d00. -5
>>>>>>>
>>>>>>> Also I see 296.619014] snd_hda_intel 0000:0a:00.1: CORB reset
>>>>>>> timeout#2, CORBRP = 65535 error during the rescan_done callback
>>>>>>> execution
>>>>>>>
>>>>>>> I know it has to do with the move of BAR's MMIO address because when i
>>>>>>> disallow BAR migration by returning true from bar_fixed callback I
>>>>>>> have no such errors and mplayer works fine.
>>>>>>>
>>>>>>> I enabled MMIO trace and didn't see that post BAR move there is a
>>>>>>> wrong MMIO access - all of them are from the new MMIO base offset -
>>>>>>> 0xfcd80000 (trace attached including mmio trace and dmesg)
>>>>>>>
>>>>>>> I would be happy for any idea on this and any comment on the
>>>>>>> correctness of my sequence in the code
>>>>>>>
>>>>>>> Andrey
>>>>>>
>>>>
>>


More information about the Alsa-devel mailing list