Preventing interrupt storms on STM32

Report here problems in any of ChibiOS components. This forum is NOT for support.
tridge
Posts: 141
Joined: Mon Sep 25, 2017 8:27 am
Location: Canberra, Australia
Has thanked: 10 times
Been thanked: 20 times
Contact:

Preventing interrupt storms on STM32

Postby tridge » Sat Aug 24, 2019 1:22 am

Continuing a recent theme on trying to make ArduPilot on ChibiOS as robust as we can make it I've created a patch to prevent interrupt storms on I2C on STM32.
This was prompted by a particularly nasty bug report we've had where a small quadcopter called a 'Solo' has occasionally fallen out of the sky. Over a few months of debugging we think the likely cause is an I2C interrupt storm, where we get an interrupt from an I2C peripheral which we either don't acknowledge or where the ack isn't accepted due to corrupted state inside the i2c peripheral. We don't have solid evidence this is the cause (we have never managed to produce the problem when the vehicle isn't actually flying, which makes it hard to debug) but we've eliminated other likely causes such as power failures, hard faults, locking errors etc. We use the IWDG to catch these types of failures, keeping critical data in the backup registers which allows us to narrow down causes. We have seen interrupt storms cause these symptoms on NuttX previously.
You can see the patch I've done to handle this here:
https://github.com/ArduPilot/ChibiOS/pull/16/files
What it does is to allow for a limit of interrupts per byte. If that limit is reached then we use the RCC to reset the peripheral and fail the I2C transaction. To test this I added a deliberate bug where we failed to acknowledge interrupts after the board had been running for a while, and the change does correctly reset the peripheral and fail the transaction. It also recovers and the I2C peripheral is usable again after the interrupt storm is squashed.
I'm not really expecting this to be merged into mainline ChibiOS, but I thought I'd post it here to get comments on the approach and in case anyone else is looking for a solution to possible interrupt storms.
When we saw this issue previously on NuttX we could reproduce it by using a 1m long I2C cable wrapped around an electric drill to inject noise into the bus. That setup reliably killed NuttX fairly quickly with an i2c interrupt storm. Unfortunately that same setup does not trigger the issue on ChibiOS which may mean this issue just doesn't exist on ChibiOS, but we'll likely apply this change anyway as it is our best guess as to a cause of the lockups, and it should be harmless if the peripheral is operating normally.
Comments welcome!
Cheers, Tridge

User avatar
Giovanni
Site Admin
Posts: 14455
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1076 times
Been thanked: 922 times
Contact:

Re: Preventing interrupt storms on STM32

Postby Giovanni » Sat Aug 24, 2019 5:56 am

Hi,

Interesting, it is limited to the LLD and optional so merging is possible after verifying details.

It looks like an HW behavior, do you know which even or state triggers it? which is the IRQ source doing this?

Moving in bug reports.

Giovanni

tridge
Posts: 141
Joined: Mon Sep 25, 2017 8:27 am
Location: Canberra, Australia
Has thanked: 10 times
Been thanked: 20 times
Contact:

Re: Preventing interrupt storms on STM32

Postby tridge » Sat Aug 24, 2019 6:37 am

Giovanni wrote:It looks like an HW behavior, do you know which even or state triggers it? which is the IRQ source doing this?

no, we don't. If this patch does fix the issue then that would give us the opportunity to log the ISR mask when it happens. As we never produced it outside flight conditions we had no way to debug it properly, as when it happens the aircraft falls out of the sky immediately and we don't get a chance to attach a debugger.
I can tell you what happened back in 2014 when we had this happen on NuttX. This is what I wrote at the time:

https://github.com/ArduPilot/PX4NuttX/b ... 2c.c#L1271

if (status & I2C_SR1_TXE) {
/* this should never happen, but it does happen
occasionally with lots of noise on the bus. It means the
peripheral is expecting more data bytes, but we don't have any to give.
This has been seen with status=0x70084, reproduced with
noise generated by a Jabra wireless headset in close
proximity to the I2C lines
*/

and this:

} else if (status & I2C_SR1_STOPF) {
/*
we should never get this, as we are a master not a
slave. Write CR1 with its current value to clear the
error
*/

I'd actually forgotten till now that I'd reproduced that first one by playing music :-)

we patched ChibiOS to cope with I2C_SR1_TXE, I2C_SR1_BTF and I2C_SR1_STOPF when we first ported ArduPilot to ChibiOS as we knew from the NuttX experience that those could happen unexpectedly.
I only thought of doing a general purpose fix for unexpected i2c interrupts recently, when we basically got frustrated with trying to track down this issue. Using a RCC reset really helped with that SPI issue recently, so I thought we could apply a similar fix for I2C.
I will also look at extending this to interrupt storms on SPI and CAN. Basically we can't trust that STM32 peripherals will always work as described in the datasheet, especially if you mistreat them by blasting them with unreasonable quantities of noise. UAV are prone to that sort of thing.
Cheers, Tridge

User avatar
Giovanni
Site Admin
Posts: 14455
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1076 times
Been thanked: 922 times
Contact:

Re: Preventing interrupt storms on STM32

Postby Giovanni » Sat Aug 24, 2019 6:42 am

I2C continues to prove troublesome on STM32...

Have you considered trying the SW fallback I2C implementation? it could prove safer after some tests.

Giovanni

User avatar
alex31
Posts: 379
Joined: Fri May 25, 2012 10:23 am
Location: toulouse, france
Has thanked: 38 times
Been thanked: 62 times
Contact:

Re: Preventing interrupt storms on STM32

Postby alex31 » Sat Aug 24, 2019 11:43 am

Hello,

As an UAV developer, i can confirm that the electrical and electromagnetic environment is very harsh. Things that works well on the bench suddenly have strange behaviour when flying.

In this context i welcome all software guard that can trap hardware fault, and vote for inclusion of i2c and spi tridge patches in trunk.

Alexandre

tridge
Posts: 141
Joined: Mon Sep 25, 2017 8:27 am
Location: Canberra, Australia
Has thanked: 10 times
Been thanked: 20 times
Contact:

Re: Preventing interrupt storms on STM32

Postby tridge » Sat Aug 24, 2019 11:52 am

Giovanni wrote:Have you considered trying the SW fallback I2C implementation? it could prove safer after some tests.

I haven't tried it, but I would guess it would have too much performance impact.
I'm actually quite hopeful that with the prevention of interrupt storms that i2c will no longer be a source of major issues like this.
Cheers, Tridge

User avatar
Giovanni
Site Admin
Posts: 14455
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1076 times
Been thanked: 922 times
Contact:

Re: Preventing interrupt storms on STM32

Postby Giovanni » Sat Aug 24, 2019 12:45 pm

I like the idea, I could look into creating some generic mechanism for IRQ spam protection to be used in drivers. I am less sure about the error handling method, resetting everything looks like a brute force approach.

It would be important to understand the causes and which IRQ source gets stuck.

Giovanni

mikeprotts
Posts: 166
Joined: Wed Jan 09, 2019 12:37 pm
Has thanked: 19 times
Been thanked: 31 times

Re: Preventing interrupt storms on STM32

Postby mikeprotts » Sat Aug 24, 2019 4:25 pm

I'd consider an option to reset on error as appropriate. The most logical idea would be an error callback (let the programmer decide), but only if the board is still able to process it.

One way to attempt to recreate would be to have several different frequency signals over a ribbon cable, perhaps square wave or random pulses. My initial use of SPI suffered from crosstalk (easily visible with logic analyser/oscilloscope) which frequently missed a lot of the data, and gave plenty of false triggers.

Mike

tridge
Posts: 141
Joined: Mon Sep 25, 2017 8:27 am
Location: Canberra, Australia
Has thanked: 10 times
Been thanked: 20 times
Contact:

Re: Preventing interrupt storms on STM32

Postby tridge » Sat Aug 24, 2019 10:40 pm

Giovanni wrote:I like the idea, I could look into creating some generic mechanism for IRQ spam protection to be used in drivers. I am less sure about the error handling method, resetting everything looks like a brute force approach.

it is brute force and quite deliberately so. It only triggers if we've already tried acking the interrupt via whatever more specific methods are in the driver and it has failed. We should still add more specific isr acks as we come across them. To get that we could add a field in the driver that gives us the isr state when this triggers, so the app can log what isr it was failing to ack.
A generic anti-spam mechanism for IRQs would be great. Are you thinking you'd just disable the vector, or would you link the vector to the peripheral somehow and reset it?
It would be important to understand the causes and which IRQ source gets stuck.

yes, that would be great, if we had a way to capture it in a flying vehicle :-)
It is also a difficult to reproduce bug, even when flying. We first saw it a few months ago with a single incident, then we had a couple more, and then 5 in one day from one aircraft. Those 5 in a row were great for narrowing it down. Those 5 were on July 16th, and the issue hasn't happened again, even with exactly the same hardware and firmware version.

tridge
Posts: 141
Joined: Mon Sep 25, 2017 8:27 am
Location: Canberra, Australia
Has thanked: 10 times
Been thanked: 20 times
Contact:

Re: Preventing interrupt storms on STM32

Postby tridge » Sun Aug 25, 2019 10:06 am

Hi Giovanni,
We've made some progress on the root cause of the problem.
After applying the patch we did some testing on a Solo with a SMBus smart battery and found that the limit we set of 6 interrupts per byte was being exceeded when we turned on the smart battery.
I added some recording of SR1/SR2 values in the interrupt handlers, and found this:

- we were doing a transfer sending 1 byte and receiving 10 bytes
- in total we got 87 interrupts for the transfer
- the ISR mask pattern was:

ISR[0] 0x30001
ISR[1] 0x70082
ISR[2] 0x70084
ISR[3] 0x70084
(repeated until the 84th interrupt)
ISR[84] 0x70084
ISR[85] 0x30001
ISR[86] 0x30002

the values above are SR1 | (SR2<<16)
so this shows the long series of interrupts was for BTF | TXE. It is interesting that the condition does clear itself after 84 interrupts, and the transaction succeeds.
The number of interrupts wasn't consistent. Max I've seen is 139 interrupts for a 4 byte transfer (1 byte send, 3 byte receive). I presume the interrupts stop when the slave device completes it's send (maybe it is clock stretching?).
Alternatively, could we be getting an issue with DMA priorities? The above test was with a STM32F427, and it was using DMA.


Return to “Bug Reports”

Who is online

Users browsing this forum: No registered users and 8 guests