[Sound-open-firmware] cache invalidation evicting wrong lines

Guennadi Liakhovetski guennadi.liakhovetski at linux.intel.com
Fri Mar 15 17:56:24 CET 2019


On Fri, Mar 15, 2019 at 03:59:41PM +0000, Liam Girdwood wrote:
> On Fri, 2019-03-15 at 13:29 +0100, Guennadi Liakhovetski wrote:
> > On Fri, Mar 15, 2019 at 12:04:20PM +0000, Liam Girdwood wrote:
> > > On Fri, 2019-03-15 at 12:47 +0100, Guennadi Liakhovetski wrote:
> > > > On Fri, Mar 15, 2019 at 10:30:30AM +0000, Liam Girdwood wrote:
> > > > > On Fri, 2019-03-15 at 10:20 +0100, Guennadi Liakhovetski wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > Working on https://github.com/thesofproject/sof/pull/1053 I'm making
> > > > > > sure
> > > > > > all the DSP cores see the same data, which on Xtensa seems to require
> > > > > > manual cache synchronisation. 
> > > > > 
> > > > > Fwiw, this depends on the ISA and xtensa IP configuration (and this can
> > > > > vary
> > > > > between xtensa devices since the ISA is extremely configurable). This
> > > > > may
> > > > > also
> > > > > be true for other DSP ISAs too. 
> > > > 
> > > > So not all configurations require manual cache sync? Do we know which ones
> > > > do
> > > > and which ones don't?
> > > 
> > > Defined for each ISA, please see arch/cache.h. Client code does not need to
> > > know
> > > the cache config.
> > > 
> > > > > > This is a painful task, it's actually even
> > > > > > only practically possible because there aren't many objects in SOF
> > > > > > that
> > > > > > can be accessed by different cores. However, with that PR interrupt
> > > > > > descriptors will also become accessible by multiple cores.
> > > > > > 
> > > > > > On Sue Creek there are two kinds of cascading interrupt controllers:
> > > > > > internal "level2" - "level5" and a DesignWare interrupt controller IP,
> > > > > > used by several peripherals, uncliding the UART. With the addition of
> > > > > > manual cache synchronisation UART interrupts stopped working.
> > > > > > Debugging
> > > > > > showed, that DW INTC interrupt descriptors lose their updates when
> > > > > > other descriptors have their cache invalidated. And those descriptors
> > > > > > aren't even immediately close to each other in RAM. Roughly the
> > > > > > following is happening:
> > > > > 
> > > > > The DW- IRQ controller IP address space is non cacheable (since it's HW)
> > > > > so
> > > > > you
> > > > > shouldn't need to invalidate any access to it. 
> > > > 
> > > > Sorry, I didn't mean physical DW INTC registers, I meant a descriptor in
> > > > software, in RAM.
> > > > 
> > > > > > dw_intc->irq = new_irq;
> > > > > > list_for_item(list, descriptors) {
> > > > > > 	desc = container_of(list, struct irq_desc, list);
> > > > > > 	dcache_invalidate_region(desc, sizeof(*desc));
> > > > > > 
> > > > > > at which point dw_intc->irq loses its new_irq value and restores the
> > > > > > initial -EINVAL value. If I either remove the invalidation line above
> > > > > > or add a writeback for dw_intc between the first two lines, the
> > > > > > problem
> > > > > > goes away.
> > > > > 
> > > > > So iiuc dw_intc->irq is being trashed when you invalidate desc ?
> > > > 
> > > > Yes,
> > > 
> > > Btw, have you tried both gcc and xt-xcc ? Also have you tried adding a
> > > memory
> > > barrier here (setting dw_intc volatile should do this) ?
> > 
> > No, only gcc. I thought about trying volatile, but I couldn't convince myself,
> > that that should be the correct solution.
> 
> Can you try with xt-xcc and volatile. This will rule out any compiler
> shenanigans.

If I have a working st-scc install... I'll try next week.

> > > > > > Therefore questions:
> > > > > > 
> > > > > > 1. am I using cache invalidation wrongly?
> > > > > > 2. if the invalidated memory region isn't cache-line aligned, what
> > > > > > happens
> > > > > >    then? Are all full lines invalidated, thus potentially erasing
> > > > > > data?
> > > > > >    But even if this were handled wrongly, it wouldn't cause this
> > > > > > problem,
> > > > > >    because those descriptors aren't adjacent in RAM.
> > > > > > 3. is that a software or a hardware bug?
> > > > > > 4. how to fix?
> > > > > > 
> > > > > 
> > > > > More questions
> > > > > 
> > > > > 1) Are other dw_intc-> members trashed ? How is this structure/member
> > > > > aligned in
> > > > > memory ?
> > > > 
> > > > Not in my test, because other members were initialised statically and only
> > > > .irq
> > > > was assigned before scanning the descriptor list and invalidating caches.
> > > > I
> > > > don't
> > > > impose any alignment so far. The DW INTC descriptor was at 0xbe046aa8 and
> > > > the
> > > > descriptor, whose invalidation caused the trouble was at 0xbe04640c. .irq
> > > > is
> > > > in
> > > > fact the first element in the struct.
> > > > 
> > > 
> > > ok, so both not aligned on 64 bytes and not on the same cache line
> > > 
> > > 
> > > > > 2) Has descriptors been invalidated ? i.e. are you invalidating a valid
> > > > > address
> > > > > for desc ?
> > > > 
> > > > Yes, addresses are valid.
> > > > 
> > > > > 3) If desc and sizeof(*desc) are not cache line/size alligned, then try
> > > > > and
> > > > > make
> > > > > sure that size is rounded up to nearest cache line and desc rounded down
> > > > > to
> > > > > nearest cache line for alignment.
> > > > 
> > > > If this is the problem, this would indicate a hardware bug? I thought
> > > > about
> > > > trying
> > > > this but according to our understanding this shouuldn't be necessary?
> > > 
> > > No HW bug, but HW can only invalidate/writeback in blocks of cacheline size
> > > with
> > > cacheline alignment. Currently the code is not checking this and this may be
> > > "undefined" at ISA level. Can you check xthal_dcache_region_writeback().
> > 
> > Right, what I suspected. So, invalidating non-cache-aligned data leads to
> > memory
> > around the desired area to also lose the cache contents, possibly leading to
> > data loss.
> > 
> > > > > 4) Is this code running on core 0 (the same core that registered all the
> > > > > IRQs) ?
> > > > 
> > > > So far Sue Creek doesn't enable SMP, so, yes.
> > > 
> > > Ok, so to confirm there is also no other IP (like DMA) that can write to
> > > L2/SRAM
> > > except core 0 ? and can you provide the whole function where this fails.
> > 
> > There is DMA, but later. This happens during initialisation, before DMA. And
> > DMA
> > doesn't touch those areas anyway. The beginning of the function, where this
> > happens
> > is below.
> > 
> 
> Ok, so is this during IRQ core init or IRQ client code init ?

during IRQ core and platform init.

> > Thanks
> > Guennadi
> > 
> > int interrupt_cascade_register(struct irq_cascade_desc *cascade)
> > {
> > 	struct irq_desc *desc = &cascade->desc;
> > 	struct irq_cascade_desc *p_cascade;
> > 	struct list_item *list;
> > 	unsigned long flags, i;
> > 	int ret;
> > 
> > 	if (!cascade->name || !cascade->ops)
> > 		return -EINVAL;
> > 
> > 	spinlock_init(&cascade->lock);
> > 	for (i = 0; i < PLATFORM_IRQ_CHILDREN; i++)
> > 		list_init(&cascade->child[i].list);
> > 
> > 	spin_lock_irq(&cascade_lock, flags);
> 
> Cascade lock doesn't need invalidating since it's an SMP atomic op.
> 
> > 
> > 	list_for_item_smp (list, &cascade_list) {
> 
> Could the list be corrupt ? I cant see where cacscade_list is invalidated ?

list_for_item_smp does that:

#define list_for_item_smp(item, list) \
	dcache_invalidate_region(list, sizeof(*(list))); \
	for (item = (list)->next, dcache_invalidate_region(item, \
							   sizeof(*(item))); \
	     item != list; \
	     item = (item)->next, dcache_invalidate_region(item, \
							   sizeof(*(item))))

> > 		struct irq_cascade_desc *c = container_of(list,
> > 						struct irq_cascade_desc, list);
> > 
> > // One of these kills "cascade"
> 
> Is cascade killed on the first list iteration or a subsequent iteration. Is this
> iteration random or always consistent. Does this bug location change with debug
> code added ?

It seemed to be consistent. First 4 "level2" - "level5" cascading controllers are
added, all is good then. Then 2 INTC controllers are added - for the low and the
high 32 bits respectively. When the first one of those is added and when the list
of already installed controllers is checked, on a specific one of them the .irq
gets invalidated.

Thanks
Guennadi

> Liam
> 
> > 		dcache_invalidate_region(c, sizeof(*c));
> > 		if (!rstrcmp(c->name, cascade->name)) {
> > 			ret = -EEXIST;
> > 			goto unlock;
> > 		}
> > 	}
> > 
> > _______________________________________________
> > Sound-open-firmware mailing list
> > Sound-open-firmware at alsa-project.org
> > https://mailman.alsa-project.org/mailman/listinfo/sound-open-firmware
> 


More information about the Sound-open-firmware mailing list