Threads are dying, I cannot find the issue :(

ChibiOS public support forum for topics related to the STMicroelectronics STM32 family of micro-controllers.

Moderators: RoccoMarco, barthess

colin
Posts: 149
Joined: Thu Dec 22, 2011 7:44 pm

Re: Threads are dying, I cannot find the issue :(

Postby colin » Thu Mar 05, 2015 12:06 am

russian wrote:Now I am less sure. Something funny is going on:

My firmware is stuck, I am pausing and resuming execution. Note how current thread changes, but p_current is always the same?

Image
Image
Image

Code: Select all

#define ON_UNLOCK_HOOK onUnlockHook()
#define dbg_leave_lock() {dbg_lock_cnt = 0;ON_UNLOCK_HOOK;}

void onUnlockHook(void) {
   uint64_t t = getTimeNowNt() - lastLockTime;
   if (t > maxLockTime) {
      maxLockTime = t;
   }
}

void onUnlockHook(void) {
 801b520:   b580         push   {r7, lr}
 801b522:   b082         sub   sp, #8
 801b524:   af00         add   r7, sp, #0
   uint64_t t = getTimeNowNt() - lastLockTime;
 801b526:   f003 fa0b    bl   801e940 <getTimeNowNt>
 801b52a:   f24f 5360    movw   r3, #62816   ; 0xf560
 801b52e:   f2c2 0300    movt   r3, #8192   ; 0x2000
 801b532:   e9d3 2300    ldrd   r2, r3, [r3]
 801b536:   1a82         subs   r2, r0, r2
 801b538:   eb61 0303    sbc.w   r3, r1, r3
 801b53c:   e9c7 2300    strd   r2, r3, [r7]
   if (t > maxLockTime) {
 801b540:   f24f 5368    movw   r3, #62824   ; 0xf568
 801b544:   f2c2 0300    movt   r3, #8192   ; 0x2000
 801b548:   681b         ldr   r3, [r3, #0]
 801b54a:   4618         mov   r0, r3
 801b54c:   f04f 0100    mov.w   r1, #0
 801b550:   e9d7 2300    ldrd   r2, r3, [r7]
 801b554:   4299         cmp   r1, r3
...
...
...



Perhaps there is something wrong with the lock-free code. But I think the alternative code that uses lockAnyContext() might have a problem too... I could be wrong, but as I try to follow the code, it seems like the following could be happening:

Some ChibiOS code (ISR epilogue or _port_switch_from_isr) calls dbg_check_unlock(). This calls dbg_leave_lock(), which is a macro that does {dbg_lock_cnt = 0;ON_UNLOCK_HOOK;}. chconf.h defines ON_UNLOCK_HOOK to call onUnlockHook(). error_handling.cpp defines onUnlockHook() which calls getTimeNowNt(). getTimeNowNt() is defined in engine_controller.cpp and it calls Overflow64Counter::get(), which is defined in efilib2.cpp and calls lockAnyContext(). The lockAnyContext() function is defined in console_io.c, and it calls isLocked() which looks at dbg_lock_count to see if it's greater than zero. It is equal to zero, however, since the dbg_leave_lock() macro already assigned dbg_lock_cnt = 0 earlier in this paragraph. So the code does a lock based on whether it's in ISR context, with if(dbg_isr_cnt > 0) chSysLockFromIsr(); else chSysLock().

I don't have a good enough understanding ChibiOS to know if the above process is acceptable and valid or not. But I think the next part is the problem. Since dbg_lock_cnt == 0, Overflow64Counter::get() will have alreadyLocked==false, and it will call unlockAnyContext() before returning. But that will cause a recursive call to this function! It goes like this: unlockAnyContext() -> chSysUnlockFromIsr() -> dbg_check_unlock_from_isr() -> dbg_leave_lock() -> {dbg_lock_cnt = 0;ON_UNLOCK_HOOK;} -> onUnlockHook() -> getTimeNowNt() -> Overflow64Counter::get() -> lockAnyContext() -> chSysLockFromIsr(), chSysUnlockFromIsr() -> infinite recursion.

Maybe I'm missing something, but it looks like that could be the case.

User avatar
russian
Posts: 364
Joined: Mon Oct 29, 2012 3:17 am
Location: Jersey City, USA
Has thanked: 16 times
Been thanked: 14 times

Re: Threads are dying, I cannot find the issue :(

Postby russian » Thu Mar 05, 2015 12:42 am

colin, you have a good point here, the version of Overflow64Counter.get with critical section should probably not be used like that, it would have to be {ON_UNLOCK_HOOK;dbg_lock_cnt = 0;}. Just checked into trunk, thank you!

On the other hand, I have already disabled the whole ON_*LOCK_HOOK thing altogether just to make things easier, I did not like seeing my code in r13.lr Still failing without my un-lock hooks.

My gut feel is that this is somehow about FPU operations: I've switched from older Makefile to a more recent Makefile, the only changes are a couple of parameters related to FPU, and this had significantly reduced the probability of hanging up under the same compiler.

I have also tried disabling some math-intense sections of the code and it does not hang up without them. Re-enabling any of the math-intense sections gets the issue back. Another argument to support my random theory that this is somehow related to FPU is the face that code compiled under GCC 4.9.2 or IAR has NEVER failed. My theory is that a functional bug in the code would fail under any compiler at least once.

User avatar
Giovanni
Site Admin
Posts: 14455
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1076 times
Been thanked: 922 times
Contact:

Re: Threads are dying, I cannot find the issue :(

Postby Giovanni » Thu Mar 05, 2015 9:12 am

Assuming it is a problem with FPU. could you create a math-intensive threaded program that fails?

Giovanni

User avatar
russian
Posts: 364
Joined: Mon Oct 29, 2012 3:17 am
Location: Jersey City, USA
Has thanked: 16 times
Been thanked: 14 times

Re: Threads are dying, I cannot find the issue :(

Postby russian » Fri Mar 06, 2015 5:02 pm

Giovanni wrote:Assuming it is a problem with FPU. could you create a math-intensive threaded program that fails?

I am trying but with no luck so far. Either more plants have to alight to get the issue, or it's just my own firmware defect. It's now third week of trying to figure it out :(

User avatar
Giovanni
Site Admin
Posts: 14455
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1076 times
Been thanked: 922 times
Contact:

Re: Threads are dying, I cannot find the issue :(

Postby Giovanni » Fri Mar 06, 2015 5:12 pm

Can you run with FPU disabled and see if it makes a difference?

Giovanni

User avatar
russian
Posts: 364
Joined: Mon Oct 29, 2012 3:17 am
Location: Jersey City, USA
Has thanked: 16 times
Been thanked: 14 times

Re: Threads are dying, I cannot find the issue :(

Postby russian » Sun Mar 08, 2015 5:43 am

Giovanni wrote:Can you run with FPU disabled and see if it makes a difference?

Disabled FPU might be making a difference:

FPU=softfp failed after 4080 cycles, FPU=no passed 8500 cycles. But that's only x2 times the cycles of a failure so does not prove much, I am now restarting it to run 22000 cycles that should take about 16 hours.

Here is normal operation - I am expecting all my threads to be usually SLEEPING
Image

And here is it when it's stuck:
Image

CURRENT thread is constantly changing but no real activity inside these threads, my status LED not blinking etc
Image
Image

User avatar
Giovanni
Site Admin
Posts: 14455
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1076 times
Been thanked: 922 times
Contact:

Re: Threads are dying, I cannot find the issue :(

Postby Giovanni » Sun Mar 08, 2015 7:36 am

The threads become all ready so they all take CPU time...

I cannot imagine what could cause this. Do you have an infinite loop somewhere? try stopping the execution, what are those thread executing when the problem happens?

Giovanni

User avatar
russian
Posts: 364
Joined: Mon Oct 29, 2012 3:17 am
Location: Jersey City, USA
Has thanked: 16 times
Been thanked: 14 times

Re: Threads are dying, I cannot find the issue :(

Postby russian » Sun Mar 08, 2015 1:17 pm

Every time I pause the execution I find myself in one of the IRQ handlers.

I did not put infinite loops anywhere on purpose, but who knows maybe I have a defect, I would love to find that defect if only I know how. By the way the FPU=no version of the binary has passed 10143 cycles already enforcing my theory that this is FPU related.

User avatar
russian
Posts: 364
Joined: Mon Oct 29, 2012 3:17 am
Location: Jersey City, USA
Has thanked: 16 times
Been thanked: 14 times

Re: Threads are dying, I cannot find the issue :(

Postby russian » Sun Mar 08, 2015 4:35 pm

Looks like it's spending all time serving interrupts and does not have a chance to spend any time in user threads.

It's worth noting that I actually enable a LOT of interrupts: here is what my NVIC_BASE region looks like 3 seconds after start:

Image

I have a number of EXTI channels to support a six-position joystick, I have two ADC channels with DMA, I use CAN and I use multiple timer input capture.

User avatar
Giovanni
Site Admin
Posts: 14455
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1076 times
Been thanked: 922 times
Contact:

Re: Threads are dying, I cannot find the issue :(

Postby Giovanni » Sun Mar 08, 2015 4:41 pm

Then there must be a reason for the serial interrupt continuously asserted even if the SR register has nothing inside, could you inspect the USART SR register when the problem is triggered? put a breakpoint at the start of the ISR like we did before.

Giovanni


Return to “STM32 Support”

Who is online

Users browsing this forum: No registered users and 10 guests