Optimization doubts with USB driver

Ceco · Postby **Ceco** » Thu Dec 06, 2018 6:27 pm

Hello. I'm using STM32F439 with external PHY USB3320C. The OS version is 18.2.0. The driver is configured for HS mode. CPU clock is 168MHz.

I'm trying to implement MSD mode with maximum performance. After a digging in the code, I found one potential performance limitation.

In file hal_usb_lld.c, there are a functions to read and write the USB FIFO. Normally these functions are implemented as simple loops, so they consume CPU time. There is a lack of information about dedicated DMA in HS mode, so I understand this implementation. But after some measurements, I was surprised from the big difference in CPU load of both functions. Here are my measurements, with 512 bytes transfer:

- without code optimization (-O0):
otg_fifo_write_from_buffer() -> takes 18us
otg_fifo_read_to_buffer() -> takes 110us

- with code optimization (-O2):
otg_fifo_write_from_buffer() -> takes 5us
otg_fifo_read_to_buffer() -> takes 40us

I'm using a GPIO and oscilloscope for measurements. The phenomenon has an explanation of course, the FIFO read code is much more complicate than FIFO write code.

By my opinion - there are some exceptional cases coded inside the loop, which makes it slower. But in fact in case of fast transfers, most of the time these exceptional cases are not relevant: So I tried to optimize a bit the otg_fifo_read_to_buffer() function as follow:

Code: Select all

static void otg_fifo_read_to_buffer(volatile uint32_t *fifop,
                                    uint8_t *buf,
                                    size_t n,
                                    size_t max) {
   uint32_t w;
   size_t cnt = n;

   while(cnt)
   {
      w = *fifop;
      if(cnt <= 4)
      {
         /* Slower but it happens just once! */
         while(cnt)
         {
            /* Take the small parts... */
            *buf++ = (uint8_t)w;
            w >>= 8;
            cnt--;
         }
         if(n > max)
         {
            /* When this happens??? */
            n -= max;
            while(n)
            {
               w = *fifop;
               if(n <= 4)
                  break;
               else
                  n -= 4;
            }
         }
      }
      else
      {
         /* Most of the time we play here */
         *((uint32_t *)buf) = w;
         cnt -= 4;
         buf += 4;
      }
   }
}

with this code I measured the following:

- without code optimization (-O0):
otg_fifo_read_to_buffer() -> takes 24us

- with code optimization (-O2):
otg_fifo_read_to_buffer() -> takes 8us

Which is 5 times faster

My point of view is the MSD driver, it uses large blocks almost all the time. Probably such an implementation will not be useful for other cases.

It's sure the best is to use DMA. But there is a lack of documentation according dedicated DMA in OTG_HS. I'm thinking about using of standard DMA in memory to memory mode... But I'm not sure is it possible and how much CPU time we will gain....

I'm open for comments.

Postby **Giovanni** » Thu Dec 06, 2018 6:47 pm

Hi,

Moving this topic in "bug reports".

Giovanni

Ceco · Postby **Ceco** » Fri Dec 07, 2018 7:24 am

Giovanni, currently it not looks like a bug, the original code works just fine. The idea is to achieve highest performance, normally this is not a issue for most of the user cases.

Meanwhile, can you give me a bit more information for your code:

Code: Select all

static void otg_fifo_read_to_buffer(volatile uint32_t *fifop,
                                    uint8_t *buf,
                                    size_t n,
                                    size_t max) {
  uint32_t w = 0;
  size_t i = 0;

  while (i < n) {
    if ((i & 3) == 0){
      w = *fifop;
    }
    if (i < max) {
      *buf++ = (uint8_t)w;
      w >>= 8;
    }
    i++;
  }
}

As I understand, in this code you "expect" a possible case when (n > max) on function calling. When this happens? By my mind, this is a situation when we have a given amount of data inside the IN endpoint (n), but we request to "take" less bytes (max). In this case we are "flushing" the RX FIFO and we are "dumping" the useless data.

Is this correct (what I wrote)?

I found one more interesting point. If the function "otg_fifo_read_to_buffer()" is NOT declared as inline, the compiler do not inline it, even on -O2 optimization. This is strange, since the function looks like a perfect candidate for inline, and I expected that compiler will inline it by itself. When I declared it like a inline manually - I gained a small improvement in execution. In fact it is really small, but the overall write performance on SD card was improved with 1-2%.

Postby **Giovanni** » Fri Dec 07, 2018 8:42 am

Hi,

Correct, the function has to empty the FIFO even if the requested size is less than the available data.

About the compiler, it does inline functions regardless of the "inline" specifier.

Giovanni

Ceco · Postby **Ceco** » Fri Dec 07, 2018 1:22 pm

Giovanni wrote:Correct, the function has to empty the FIFO even if the requested size is less than the available data.

Ok, but how this can happen? I mean what a user case? And why we dump the data, maybe it will be requested with the next function invoking? Sorry if the question is strange, but in my head the dumping of data is something bad, or let's say it softly - exceptional.

Giovanni wrote:About the compiler, it does inline functions regardless of the "inline" specifier.

In the exact case - if there is NO inline directive on function declaration - the compiler DON'T makes the function inline even if a maximal level of optimization is selected. If the function is declared like inline - compiler makes it inline when a maximum level of optimization is selected.

I'm also expecting a different behavior of compiler, but this is the case.

Postby **Giovanni** » Fri Dec 07, 2018 1:47 pm

It can happen if the host sends more data than the device is waiting for.

Giovanni

Ceco · Postby **Ceco** » Fri Dec 07, 2018 2:27 pm

Ok, I will leave it as it is. In my implementation it is coded as exceptional case at the end of the loop, so it not affects the performance of the loop at all.

With this small correction the write speed of the MSD driver raised twice, so I'm pretty happy with the results. I will post some report in Mass Storage thread, just for information.

For the moment I will not deal with the DMA. The dedicated DMA is not well documented. A standard DMA channel in memory-to-memory transfer is not so easy to be implemented. The both read and write FIFO functions works in interrupt context. So normally we should use DMA in pooling mode, which is not really cool... For real performance gain, I need to redesign a big part of usb driver - but I don't feel enough knowledge for such a task.

Thanks for the support, Giovani, hope this thread will be useful for future improvement of the USB driver, too.

Postby **Giovanni** » Fri Dec 07, 2018 2:31 pm

DMA support will be introduced, it is already planned.

It is missing in the current implementation because there are devices with asymmetric peripherals, FS without DMA and HS with DMA.

Giovanni

Ceco · Postby **Ceco** » Fri Dec 07, 2018 2:41 pm

Super.

Postby **Giovanni** » Mon Dec 31, 2018 9:29 am

Hi,

Do you have a final recommended optimization about this? I am trying to close as many tickets as possible before next release.

Giovanni

ChibiOS Free Embedded RTOS

Optimization doubts with USB driver

Optimization doubts with USB driver

Re: Optimization doubts with USB driver

Re: Optimization doubts with USB driver

Re: Optimization doubts with USB driver

Re: Optimization doubts with USB driver

Re: Optimization doubts with USB driver

Re: Optimization doubts with USB driver

Re: Optimization doubts with USB driver

Re: Optimization doubts with USB driver

Re: Optimization doubts with USB driver

Who is online