On Wed, Dec 05, 2018 at 10:50:29AM -0700, Daniel Kurtz wrote:
I agree, there is no guarantee here once things have gone wrong, and the concerns above are reasonable. However, in the real world, I2C transactions do sometimes fail for various reasons. The I2C (and
It's *vanishingly* rare and like I say we don't have any constructive recovery plans - I've never seen a practical system that had any better idea than hoping the user noticed a problem and reboots. Probably the best we can really do is try to reset the device, or at least resync the register map. That would probably do something useful in the vanishingly rare case where it were a singular glitch, at the cost of obvious and painful user visible issues (which would probably be happening anyway). The nearest I've seen to that is one of the CODEC drivers which has watchdog code to monitor the device in case it spontaneously reboots and will resync the register cache if that happened.
I'm gathering that there are Chromebooks that do have I/O problems here?
other bus) APIs have ways of reporting up their errors so callers can take appropriate action. This codec driver can run on all kinds of hardware that can experience transient I2C errors, thus it sounds like a reasonable idea to have the driver do some error checking on the APIs it calls and take whatever action it can. Just ignoring the errors and proceeding like nothing is wrong is one option, but we can probably do a little better by at least checking for errors, abort the current operation, and pass up errors to higher layers when an i2c transaction failure is detected. If nothing else, this would enable higher policy layers to take appropriate corrective action - for example, if there is an i2c error when configuring a codec, it seems advisable to report this up so that a machine driver would know to abort and not turn on downstream amplifiers [I am assuming here that something like this would happen, I don't know if the sound stack really works this way].
Yeah, so this is partly why I'm not super thrilled here - none of the layers currently have any especially bright ideas about how to handle things and giving up part way through an operation and returning an error without any attempt to unwind or recover feels like it's just trying to give a false sense of security.
Once the default "check, abort and report" error checking is in place, we could perhaps think about actions that the driver could take to recover from various failures, such as resetting the device or unwinding previous transactions before aborting, or retrying.... but those are all policy, and this patch is more mechanism that enables policy.
But is this going to be a useful way of handling such policy and is any work on that from whoever has these unreliable systems likely to be forthcoming? We're going to end up with partially done reconfigurations in the register cache which is a potential issue for recovery strategies based around resyncing that, it's not so bad on things like starting a stream but bias level reconfigurations could be fun.