[PATCH 04/17] ASoC: Intel: avs: Inter process communication

Mon Feb 28 16:19:21 CET 2022

On 2022-02-25 9:37 PM, Pierre-Louis Bossart wrote:
> 
>>>> +static inline void avs_ipc_err(struct avs_dev *adev, struct
>>>> avs_ipc_msg *tx,
>>>> +                   const char *name, int error)
>>>> +{
>>>> +    /*
>>>> +     * If IPC channel is blocked e.g.: due to ongoing recovery,
>>>> +     * -EPERM error code is expected and thus it's not an actual error.
>>>> +     */
>>>> +    if (error == -EPERM)
>>>> +        dev_dbg(adev->dev, "%s 0x%08x 0x%08x failed: %d\n", name,
>>>> +            tx->glb.primary, tx->glb.ext.val, error);
>>>> +    else
>>>> +        dev_err(adev->dev, "%s 0x%08x 0x%08x failed: %d\n", name,
>>>> +            tx->glb.primary, tx->glb.ext.val, error);
>>>> +}
>>>
>>> we've used such functions before and the feedback, e.g. from GregKH and
>>> Mark Brown, has consistenly been that this is pushing the use of dev_dbg
>>> too far.
>>
>>
>> In basically all cases the outcome is going to be dev_err(). dev_dbg()
>> is here to help keep DSP-recovery scenario viewer-friendly when checking
>> dmesg. Ideally, there should be no DSP-recoveries to begin with : )
> 
> I will refer you to this thread:
> 
> https://lore.kernel.org/alsa-devel/YGX5AUQi41z52xk8@kroah.com/

That's an interesting lecture, thanks for sharing the link.

Most of the time, we do want to dump an dev_err() if function fails for 
non-trivial reason. During recovery scenario though, we force-disconnect 
all the streams before attempting DSP reboot. That results in "wall of 
red" i.e. error messages. Since we know that all these errors are caused 
by the disconnection of the streams, there is no real value for flagging 
them as errors. It's debug-friendly (for a developer addressing the 
possible problem) to have only important marked as errors in dmesg.

Also, avs_ipc_err() has a very specific purpose and is used only by IPC 
handlers and nowhere else.

>>
>>>> +#define AVS_IPC_TIMEOUT_MS    300
>>>
>>> skl-sst-ipc.c:#define IPC_TIMEOUT_MSECS         3000
>>>
>>> that's one order of magniture lower. please add a comment or align.
>>>
>>>> +static void avs_dsp_receive_rx(struct avs_dev *adev, u64 header)
>>>> +{
>>>> +    struct avs_ipc *ipc = adev->ipc;
>>>> +    union avs_reply_msg msg = AVS_MSG(header);
>>>> +
>>>> +    ipc->rx.header = header;
>>>> +    if (!msg.status)
>>>> +        memcpy_fromio(ipc->rx.data, avs_uplink_addr(adev),
>>>> +                  ipc->rx.size);
>>>
>>> it wouldn't hurt to describe that the status determines whether
>>> additional information can be read from a mailbox.
>>
>>
>> Isn't that consisted with the behaviour of typical API function? Do not
>> copy memory and return it to the caller if something went wrong along
>> the way?
> 
> oh, I thought this was a case where the header contains all the
> information, and only in specific cases you need to read stuff from the
> mailbox.
> 
> You definitively need to add a comment on whether this is an error
> handling or a feature.

Ack.

>>>> +void avs_dsp_process_response(struct avs_dev *adev, u64 header)
>>>> +{
>>>> +    struct avs_ipc *ipc = adev->ipc;
>>>> +
>>>> +    if (avs_msg_is_reply(header)) {
>>>
>>> the naming is confusing, it's difficult for me to understand that a
>>> 'response' could not be a 'reply'. The two terms are synonyms, aren't
>>> they?
>>
>>
>> Those two are not the same from the firmware's point of view and thus
>> they are not the same here. Response is either a reply or a
>> notification. Replies are solicited, a request has been sent beforehand.
>> Notifications are unsolicited, you are not sure when exactly and if at
>> all they arrive.
> 
> add a comment then.

Ack.

>> Just so I'm not called a heretic later: yes, from English dictionary
>> point of view these two words are synonyms. In general, wording found in
>> this drivers matches firmware equivalents wherever possible to allow
>> developers to switch between these two worlds with minimal adaptation
>> period possible.
> 
>>
>>>> +    /* DSP acked host's request */
>>>> +    if (hipc_ack & spec->hipc_ack_done_mask) {
>>>> +        /* mask done interrupt */
>>>> +        snd_hdac_adsp_updatel(adev, spec->hipc_ctl_offset,
>>>> +                      AVS_ADSP_HIPCCTL_DONE, 0);
>>>> +
>>>> +        complete(&ipc->done_completion);
>>>> +
>>>> +        /* tell DSP it has our attention */
>>>> +        snd_hdac_adsp_updatel(adev, spec->hipc_ack_offset,
>>>> +                      spec->hipc_ack_done_mask,
>>>> +                      spec->hipc_ack_done_mask);
>>>> +        /* unmask done interrupt */
>>>> +        snd_hdac_adsp_updatel(adev, spec->hipc_ctl_offset,
>>>> +                      AVS_ADSP_HIPCCTL_DONE,
>>>> +                      AVS_ADSP_HIPCCTL_DONE);
>>>
>>> does the order between the complete() and the next two register updates
>>> matter?
>>>
>>> I would have updated the registers immediately and signal the completion
>>> later.
>>>
>>> I am also not sure why it's necessary to mask the done interrupt then
>>> unmask it. There is nothing that seems to require this masking?
>>>
>>> Or are you expecting the code blocked on wait_for_completion to be
>>> handled with interrupts masked, which could be quite racy?
>>
>>
>> Given how the things turned out in cAVS, some steps are not always
>> required. Here, we have very strict implementation and so interrupt are
>> masked.
>>
>> I'm unsure if relocating complete() to the bottom would bring any
>> consequences. And no, the code waiting_for_completion is not expecting
>> interrupts to be masked as there is no reply for ROM messages.
> 
> it would be just fine to add that the masking is added as an extra
> precaution, the order does not matter and the code executed after the
> complete() does not assume any masking.

Ack.

>>
>>>> +        ret = IRQ_HANDLED;
>>>> +    }
>>>> +
>>>> +    /* DSP sent new response to process */
>>>> +    if (hipc_rsp & spec->hipc_rsp_busy_mask) {
>>>> +        /* mask busy interrupt */
>>>> +        snd_hdac_adsp_updatel(adev, spec->hipc_ctl_offset,
>>>> +                      AVS_ADSP_HIPCCTL_BUSY, 0);
>>>> +
>>>> +        ret = IRQ_WAKE_THREAD;
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>
>>>> +static int avs_ipc_wait_busy_completion(struct avs_ipc *ipc, int
>>>> timeout)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +again:
>>>> +    ret = wait_for_completion_timeout(&ipc->busy_completion,
>>>> +                      msecs_to_jiffies(timeout));
>>>> +    /*
>>>> +     * DSP could be unresponsive at this point e.g. manifested by
>>>> +     * EXCEPTION_CAUGHT notification. If so, no point in continuing.
>>>
>>> EXCEPTION_CAUGHT doesn't seem to be described in this patchset, so not
>>> sure what this comment might mean.
>>
>>
>> Comment describes the circumstances for the communication failures and
>> arrival of EXCEPTION_CAUGHT notification is one of them.
> 
> that detail is unnecessary for reviewers.

Ack.

>>
>>>> +     */
>>>> +    if (!ipc->ready)
>>>> +        return -EPERM;
>>>> +
>>>> +    if (!ret) {
>>>> +        if (!avs_ipc_is_busy(ipc))
>>>> +            return -ETIMEDOUT;
>>>> +        /*
>>>> +         * Firmware did its job, either notification or reply
>>>> +         * has been received - now wait until it's processed.
>>>> +         */
>>>> +        wait_for_completion_killable(&ipc->busy_completion);
>>>
>>> can you elaborate on why wait_for_completion() is not enough? I haven't
>>> seen the 'killable' attribute been used by anyone in sound/
>>
>>
>> This is connected to how firmware handles messaging i.e. via queue. you
>> may get BUSY interrupt caused by a notification while waiting for the
>> reply for your request. Being 'disturbed' by a notification does not
>> mean firmware is dead, it's just busy and so we wait until previous
>> response is processed entirely.
> 
> this does not clarify why 'killable' is necessary?

Usage of 'killable' variant adheres to its documentation. Sys calls can 
terminate the waiter. More user friendly.

>>
>>>> +    }
>>>> +
>>>> +    /* Ongoing notification's bottom-half may cause early wakeup */
>>>> +    spin_lock(&ipc->rx_lock);
>>>> +    if (!ipc->rx_completed) {
>>>> +        /* Reply delayed due to notification. */
>>>> +        reinit_completion(&ipc->busy_completion);
>>>> +        spin_unlock(&ipc->rx_lock);
>>>> +        goto again;
>>>
>>> shouldn't there be some counter to avoid potential infinite loops here?
>>
>>
>> This is not a bad idea although the counter is going to be high e.g.:
>> 128. With DEBUG-level logs enabled you can get ton of messages before
>> your reply gets finally sent.
> 
> dev_dbg() in interrupts is usually not very helpful. we're trying to
> move to traces instead.

Wasn't precise enough, I appologize for that. By "DEBUG-level logs" I 
meant firmware logging, not dev_dbg() on kernel side. When enabled with 
log level DEBUG, you will get at least 1 message per sys tick, resulting 
in gigabyte logs in no time.

>>
>>>> +    }
>>>> +
>>>> +    spin_unlock(&ipc->rx_lock);
>>>> +    return 0;
>>>> +}
>>>
>>>> +static int avs_dsp_do_send_msg(struct avs_dev *adev, struct
>>>> avs_ipc_msg *request,
>>>> +                   struct avs_ipc_msg *reply, int timeout)
>>>> +{
>>>> +    struct avs_ipc *ipc = adev->ipc;
>>>> +    int ret;
>>>> +
>>>> +    if (!ipc->ready)
>>>> +        return -EPERM;
>>>> +
>>>> +    mutex_lock(&ipc->msg_mutex);
>>>> +
>>>> +    spin_lock(&ipc->rx_lock);
>>>> +    avs_ipc_msg_init(ipc, reply);
>>>> +    avs_dsp_send_tx(adev, request);
>>>> +    spin_unlock(&ipc->rx_lock);
>>>> +
>>>> +    ret = avs_ipc_wait_busy_completion(ipc, timeout);
>>>> +    if (ret) {
>>>> +        if (ret == -ETIMEDOUT) {
>>>> +            dev_crit(adev->dev, "communication severed: %d,
>>>> rebooting dsp..\n",
>>>> +                 ret);
>>>
>>> dev_crit() seems over the top if there is a recovery mechanism
>>
>>
>> There is just one dev_crit() within entire driver and it's there for a
>> reason - communication failure is critical and in practice, should never
>> occur in any scenario on the production hardware.
> 
> git grep dev_crit shows mostly this being used for temperature and
> shorts in codec drivers. that seems more 'critical' than a communication
> failure that likely does not result in spontaneous combustion.

Few dev_crit()s can also be found in other components as well.

Without audio and graphics there is no real 'user experience'. Abrupt 
closure of communication between DSP firmware and kernel driver can, and 
usually is a consequence of either an undefined behaviour (in process 
running on DSP) or hardware issue. While I can't spare the details here 
for obvious reasons, not all situations can even be recovered with 
reboot. That usually depends in which power wells registers reside. The 
100% confirmed solution for laptops is removing battery for a second or 
day to force G3.

Considering this, I believe having a single dev_crit() here is justified.

Regards,
Czarek