Page 1 of 1

macWaitTransmitDescriptor issue  Topic is solved

Posted: Thu Jan 23, 2020 8:33 pm
by amendola
I'm using ChibiOS 16.1.9 and compiling for STM32F4 with arm-none-eabi-gcc.
I have around 10 threads including LWIP threads handling calls from the sequential-API (netconn_sendto() / netconn_recv()).

The issue I'm seeing is in the MAC layer, when I try to send out multiple large ethernet frames in close succession, the first 2 will get transmitted, but the third ends up hanging in macWaitTransmitDescriptor() within osalThreadEnqueueTimeoutS() for the full timeout (long after the first two packets are transmitted and the transmit descriptors are released by DMA. However, this issue only happens once in a while. I have the driver configured with 2 transmit descriptors (STM32_MAC_TRANSMIT_BUFFERS = 2).

When the issue does NOT occur, the sequence of events is:
1. The tcp/ip thread puts ethernet frames into the two available transmit descriptors and sets them as 'owned by DMA'
2. While trying to send the third packet, the tcp/ip thread checks for an available transmit descriptor, and sees that there are none available (both still owned by DMA)
3. The tcp/ip thread is is enqueued with a 50ms (LWIP_SEND_TIMEOUT) timeout, waiting to be woken up (at osalThreadEnqueueTimeoutS() inside macWaitTransmitDescriptor())
4. Soon after (much less than 50ms) The DMA finishes transmitting one or both of the enqueued packets and releases the transmit descriptor(s). This calls the STM32_ETH_HANDLER ISR which wakes up the tcp/ip thread. (This happens in os/hal/ports/STM32/LLD/MACv1/mac_lld.c)
5. The driver, freshly woken up, sees that the transmit descriptor(s) is now available, and it continues transmitting

However, there seems to be a race condition which allows 4 to happen before 3 -- the ISR runs after the TCP/IP thread determines that there are no free transmit descriptors, but before the TCP/IP thread enters osalThreadEnqueueTimeoutS(). Then, since it was not yet enqueued when the ISR tried to dequeue it, the TCP/IP thread is never dequeued until the osalThreadEnqueueTimeoutS() times out.

The bug seems to be in the Chibios MAC driver. In /os/hal/src/mac.c : macWaitTransmitDescriptor(), the interrupt can be raised in between the call to mac_lld_get_transmit_descriptor() and the call to osalThreadEnqueueTimeoutS(). A proposed solution would be in macWaitTransmitDescriptor(), to move the osalSysLock() before the while loop and removing syslocks/unlocks from the implementations of mac_lld_get_transmit_descriptor(). These changes prevent the ISR from sneaking in after the call to mac_lld_get_transit_descriptor(), but before the call to osalSysLock().

I have a band-aid to get around this issue (increasing the number of transmit descriptors and decreasing the LWIP_SEND_TIMEOUT), but as far as I can tell, this should fix the root-cause. I could certainly be missing something here though or overlooking some other side-effects of putting the whole block of code within a syslock.

Re: macWaitTransmitDescriptor issue

Posted: Thu Jan 23, 2020 8:58 pm
by Giovanni
Thanks, that code hasn't been modified for a while, I need to look at it carefully.


Re: macWaitTransmitDescriptor issue

Posted: Sun Apr 19, 2020 4:37 pm
by Giovanni

After a while...

Fixed bug as #1083.

There was also a potential similar issue in macWaitReceiveDescriptor(), fixed both instances exactly as you suggested.