I'm trying to implement MSD mode with maximum performance. After a digging in the code, I found one potential performance limitation.
In file hal_usb_lld.c, there are a functions to read and write the USB FIFO. Normally these functions are implemented as simple loops, so they consume CPU time. There is a lack of information about dedicated DMA in HS mode, so I understand this implementation. But after some measurements, I was surprised from the big difference in CPU load of both functions. Here are my measurements, with 512 bytes transfer:
- without code optimization (-O0):
otg_fifo_write_from_buffer() -> takes 18us
otg_fifo_read_to_buffer() -> takes 110us
- with code optimization (-O2):
otg_fifo_write_from_buffer() -> takes 5us
otg_fifo_read_to_buffer() -> takes 40us
I'm using a GPIO and oscilloscope for measurements. The phenomenon has an explanation of course, the FIFO read code is much more complicate than FIFO write code.
By my opinion - there are some exceptional cases coded inside the loop, which makes it slower. But in fact in case of fast transfers, most of the time these exceptional cases are not relevant: So I tried to optimize a bit the otg_fifo_read_to_buffer() function as follow:
Code: Select all
static void otg_fifo_read_to_buffer(volatile uint32_t *fifop,
uint8_t *buf,
size_t n,
size_t max) {
uint32_t w;
size_t cnt = n;
while(cnt)
{
w = *fifop;
if(cnt <= 4)
{
/* Slower but it happens just once! */
while(cnt)
{
/* Take the small parts... */
*buf++ = (uint8_t)w;
w >>= 8;
cnt--;
}
if(n > max)
{
/* When this happens??? */
n -= max;
while(n)
{
w = *fifop;
if(n <= 4)
break;
else
n -= 4;
}
}
}
else
{
/* Most of the time we play here */
*((uint32_t *)buf) = w;
cnt -= 4;
buf += 4;
}
}
}
with this code I measured the following:
- without code optimization (-O0):
otg_fifo_read_to_buffer() -> takes 24us
- with code optimization (-O2):
otg_fifo_read_to_buffer() -> takes 8us
Which is 5 times faster
My point of view is the MSD driver, it uses large blocks almost all the time. Probably such an implementation will not be useful for other cases.
It's sure the best is to use DMA. But there is a lack of documentation according dedicated DMA in OTG_HS. I'm thinking about using of standard DMA in memory to memory mode... But I'm not sure is it possible and how much CPU time we will gain....
I'm open for comments.