USB_USE_WAIT, TxFIFO Errata and STM32_USB_OTGFIFO_FILL_BASEPRI Topic is solved

Report here problems in any of ChibiOS components. This forum is NOT for support.
apmorton
Posts: 36
Joined: Fri Sep 29, 2017 10:26 am
Been thanked: 16 times

USB_USE_WAIT, TxFIFO Errata and STM32_USB_OTGFIFO_FILL_BASEPRI  Topic is solved

Postby apmorton » Mon Oct 16, 2017 11:39 pm

Hello All,

I have been facing a pretty obscure issue recently that has been absolutely impossible to find documentation or solutions on - figured I would post my findings here in case anyone else runs into this issue.
The company I work for recently reached out to STM directly and their engineers confirmed the issue, and pointed to an errata for the STM32L4, commenting that this errata in particular applies to all chips with the OTG_FS and OTG_HS cores - not just the L4 series.

In my case, the issue presented as an endpoint becoming "stuck" - without application intervention the core begins to send partial transfers whenever serviced by the host. These transfers are invalid, so the host will throw them out, but the transfer never "completes" as far as chibios is concerned, so your thread using usbTransmit blocks forever.

A solution for my case was to set STM32_USB_OTGFIFO_FILL_BASEPRI in mcuconf.h

Code: Select all

#define STM32_USB_OTGFIFO_FILL_BASEPRI       1


Errata Here, Section 2.9.4 "Data FIFO gets corrupted if the write sequence to the Transmit FIFO
is interleaved with other OTGFS register access".

Things to note about this errata, based on answers directly from STM engineers
  • This applies to all parts with OTG_FS or OTG_HS cores
  • This issue applies to both OTG_FS and OTG_HS cores, not just OTG_FS as mentioned in the errata
  • On chips with both the OTG_FS and OTG_HS, they are completely independent. Preempting OTG_FS TxFIFO filling to access registers on the OTG_HS will not trigger the issue

It is also important to note, I don't believe it is normally feasible to trigger this issue with ChibiOS unless you enable USB_USE_WAIT and have multiple threads blocking on usb transfers. If you have a single thread using blocking transfers, or use the non-blocking API's from a single thread you should not trigger this issue. However, for us this issue is very random and as such it is hard to really say for sure what set of conditions will or will not trigger it. It basically boils down to the usb_lld_pump thread being preempted while running the otg_fifo_write_from_buffer routine - and then something else in the system reading or writing any register that belongs to the same OTG core the lld pump thread was filling the TxFIFO of. Actually spotting a condition in your application which could trigger this is not simple. Proving your application could never trigger this condition is even more difficult. In our case, it depended heavily on the load of the system as a whole, the amount of usb traffic, and the direction of the wind on mars approximately 30 seconds in the future.

Having said that, I can't think of a scenario where you would use blocking API's for USB and not have multiple threads issuing usb transfers (assuming your device has multiple endpoints).

My understanding is the usb driver in ChibiOS is implemented this way for performance reasons, but I believe in this case due to the known errata of the STM OTG cores it makes the driver in its default configuration potentially unstable.

I propose the default behavior be changed to a more conservative, less potentially dangerous state. The TxFIFO should be filled while the system is in a locked state - effectively STM32_USB_OTGFIFO_FILL_BASEPRI defined to 1, rather than 0.

User avatar
Giovanni
Site Admin
Posts: 12943
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 729 times
Been thanked: 609 times
Contact:

Re: USB_USE_WAIT, TxFIFO Errata and STM32_USB_OTGFIFO_FILL_BASEPRI

Postby Giovanni » Tue Oct 17, 2017 7:19 am

Hi,

Thanks for the report, I knew the errata, the driver *should* prevent access to other registers while a transfer is in progress but apparently this is not the case. I will look into this.

About doing the transfer in locked state, this would defeat the need for the pump thread, it exists in order to not do data copying with interrupts disabled. Early versions of the driver did exactly that, data copy was performed in ISR context.

Giovanni

apmorton
Posts: 36
Joined: Fri Sep 29, 2017 10:26 am
Been thanked: 16 times

Re: USB_USE_WAIT, TxFIFO Errata and STM32_USB_OTGFIFO_FILL_BASEPRI

Postby apmorton » Tue Oct 17, 2017 9:51 pm

Hi Giovanni,

First, I wanted to start by thanking you for your hard work on this project - it is truly the best RTOS available for STM32. Many thanks :D

The driver in its current state prevents interrupts on the OTG core being filled from firing, which does generally prevent the majority of register access to the OTG core while filling the TxFIFO. However, with the default configuration in example mcuconf.h files the driver will not prevent other threads which may access OTG registers from preempting the pump thread while it is filling a TxFIFO. (due to the pump thread being lowest priority and STM32_USB_OTGFIFO_FILL_BASEPRI being zero).

I don't know the exact sequence of events that causes this to happen, but generally I have thought of it like the following. Although I have no concrete proof of this exact sequence of events, it does seem like the likelihood of a crash in our system depended on the amount of activity happening in unrelated peripherals (OTG_HS as a USB host in our case):

  • In the OTG ISR, an endpoint is serviced and has completed a transaction, _usb_isr_invoke_out_cb is called
  • The ISR returns, and the thread that issued the blocking USB transaction is scheduled, it then blocks on an unrelated part of the application
  • The the OTG ISR, an endpoint is serviced and requires TxFIFO filling, so it resumes the pump thread
  • The ISR returns, and the pump thread is scheduled
  • Partway through filling the TxFIFO another unrelated peripheral interrupts the pump thread
  • The ISR clears the condition that was blocking the original thread from above
  • The ISR returns, and due to the thread from above having a higher priority than the pump thread, it is scheduled
  • The thread attempts to setup another OUT transaction, and clobbers the TxFIFO in the process

Perhaps a solution could be to indicate the start of TxFIFO filling by setting a flag, and then ending TxFIFO by clearing the flag and broadcasting on an event source. Normal or S class USB API's could check this flag under lock and block on the event broadcast if TxFIFO filling was active. Use of I Class USB API's when the flag is true could either be considered a bug or not check at all. I am happy to assist in implementation or testing of any changes regarding this issue - it may be difficult for others to actually reproduce this issue reliably in the first place.

That, or simply move the TxFIFO filling back to ISR context. I guess it really depends on the performance impact.
I may be in the minority here, but the entire function of my application is to shovel data back and forth over USB, so the performance impact of USB on _other_ parts of the system is not a concern. However, the performance of the USB system itself is critical.

Austin

User avatar
Giovanni
Site Admin
Posts: 12943
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 729 times
Been thanked: 609 times
Contact:

Re: USB_USE_WAIT, TxFIFO Errata and STM32_USB_OTGFIFO_FILL_BASEPRI

Postby Giovanni » Wed Oct 18, 2017 7:43 am

Hi,

I am considering to introduce a global mutex for OTG access from threads, including the pump thread, this should prevent any possible conflict scenario. I will give it a try after finishing the current task.

Should this not work or prove messy then the lesser evil is to return doing data copy in the ISR (and make the OTG IRQ have a very low priority so it can be preempted) by other peripherals.

Giovanni

User avatar
Giovanni
Site Admin
Posts: 12943
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 729 times
Been thanked: 609 times
Contact:

Re: USB_USE_WAIT, TxFIFO Errata and STM32_USB_OTGFIFO_FILL_BASEPRI

Postby Giovanni » Tue Oct 31, 2017 2:35 pm

Hi,

Moving in "bug reports".

Giovanni

User avatar
Giovanni
Site Admin
Posts: 12943
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 729 times
Been thanked: 609 times
Contact:

Re: USB_USE_WAIT, TxFIFO Errata and STM32_USB_OTGFIFO_FILL_BASEPRI

Postby Giovanni » Fri Jul 27, 2018 2:58 pm

Hi,

I committed a modified driver in trunk, it does transfers in ISR context, the pump thread has been removed. I choose this solution because simplicity and, looking ahead, implementation of DMA transfers.

Please test this solution, if the feedback is positive I will backport it to 17.6.4.

Giovanni

apmorton
Posts: 36
Joined: Fri Sep 29, 2017 10:26 am
Been thanked: 16 times

Re: USB_USE_WAIT, TxFIFO Errata and STM32_USB_OTGFIFO_FILL_BASEPRI

Postby apmorton » Wed Aug 15, 2018 7:32 pm

Hi Giovanni,

Sorry I never got back to you on this issue.
We had already gotten a resolution to our stability issue with the STM32_USB_OTGFIFO_FILL_BASEPRI hack and moved on.
We were on 16.x at the time and didn't have enough time to mess around getting the codebase working on trunk.

I have used 17.x and 18.x on some smaller USB based projects since then and not had any issues, although its worth noting that lack of a crash isn't proof of a fix since this issue is so random in nature.

The project in question has long since been completed (and 100% stable with the STM32_USB_OTGFIFO_FILL_BASEPRI fix), and I don't really have time to revisit it at this point.

Moving the fifo fills to the ISR is equivalent to STM32_USB_OTGFIFO_FILL_BASEPRI 1 if I am not mistaken, so the fix should work perfectly fine.

I don't know if backporting this kind of change is a good or bad idea. Its a pretty rare bug, and the change could have a pretty large impact on performance for both USB and other peripherals. If someone is hitting this bug in an older version of chibios they can work around it by setting STM32_USB_OTGFIFO_FILL_BASEPRI. If the fix is backported and it causes some significant performance change to an existing application, the user is kinda hosed. On the other hand, this is one of those bugs that you cannot find no matter how hard you look, so having it fixed for all users definitively is a good thing.

Also, considering this post has been around almost 10 months and nobody else has commented about the issue may indicate that its simply not an issue most/any other users run into.

Thanks for your work on this project as always.

Austin


Return to “Bug Reports”

Who is online

Users browsing this forum: No registered users and 5 guests