STM32 MAC driver optimization

This forum is dedicated to feedback, discussions about ongoing or future developments, ideas and suggestions regarding the ChibiOS projects are welcome. This forum is NOT for support.
User avatar
Giovanni
Site Admin
Posts: 13012
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 744 times
Been thanked: 620 times
Contact:

Re: STM32 MAC driver optimization

Postby Giovanni » Thu Jan 17, 2013 8:39 pm

Hi,

The clean approach is to fix lwIP, probably you should contact one of the developers and explain the problem, I know one of them but I haven't had news from him for a while now. The best approach is to let them analyze the problem and decide if to make corrections.

In my opinion pbufs handling should be improved to support external buffers and a finalization callback.

Giovanni

jcbarlow
Posts: 21
Joined: Sun Jul 03, 2011 6:48 am
Location: Bend, Oregon, USA

Re: STM32 MAC driver optimization

Postby jcbarlow » Fri Jan 18, 2013 12:39 am

You might want to look at http://savannah.nongnu.org/task/?7896 and http://savannah.nongnu.org/patch/?7658
There seems to be some interest in that sort of tinkering.

iggarpe
Posts: 129
Joined: Sun Sep 30, 2012 8:32 pm

Re: STM32 MAC driver optimization

Postby iggarpe » Fri Jan 18, 2013 2:14 am

rubenswerk wrote:Hello,
One hint to iggarpe:
I also did some testing with massive ping flooding (no pause between packets, maximum packet size). I found out that this situation leads to a memory leak in lwip.

viewtopic.php?f=3&t=23&start=180#p3888

Maybe this is relevant for you, because you're talking about packet bursts in your use case.


Thanks for the hint, I gave it a try and I can definitely confirm that there is some sort of memory leak somewhere (in lwIP or lwipthread, most likely the first). I'm using the latest lwip 1.4.1.

BUT it only affects ICMP, not UDP. This is my setup and test method:

1- I have two servers: FTP (netconn API) and a simple UDP receiver that just checks a sequence number and tells when packets are lost. I needed the later to find out how many UDP packets can be handled during a flowd before they start to be dropped.

2- I start a large FTP transfer.

3- I send a repetitive huge flowd of UDP packets.

The above works perfectly both with ChibiOS trunk code and my modified multibuffer zero-copy reception (which dramatically increases my ability to handle the flowd). I mean FTP works perfectly, obviously any UDP packets are dropped, that's expected. The point here is that the IP stack remains sane. The FTP transfer completes (though is slower, as expected) and further FTP connections and transfers are possible and work fine.

If instead of the UDP flowd I send a ping flowd with the following linux command (more convenient for me than hrPing):

Code: Select all

sudo ping -c 100000 -f -l 100000 192.168.0.20


Then I can no longer open FTP data connections for directory listing or file transfer. I'm still investigating but it seems pretty clear that some of the pools are empty. I'll keep this thread posted with any progress.

EDIT: I've enabled lwip stats display and it seems after all I'm doing something wrong with the netconn API, and neither the ping nor the UDP flood have adverse effects. I'll fix my netconn usage tomorrow and edit again to confirm.
Last edited by iggarpe on Fri Jan 18, 2013 4:33 am, edited 1 time in total.

iggarpe
Posts: 129
Joined: Sun Sep 30, 2012 8:32 pm

Re: STM32 MAC driver optimization

Postby iggarpe » Fri Jan 18, 2013 2:31 am

Giovanni wrote:The clean approach is to fix lwIP, probably you should contact one of the developers and explain the problem, I know one of them but I haven't had news from him for a while now. The best approach is to let them analyze the problem and decide if to make corrections.

In my opinion pbufs handling should be improved to support external buffers and a finalization callback.


Sure, but I can't wait for that to happen. I'll contact them though and see if I can push that feature.

FYI, I've got zero-copy multibuffer working and it seems rock solid after quite a lot of testing. The performace increase is dramatic (memory-wise). I went with a simplified version of option (b) in my previous message:

1- Minimal and pretty much trivial patch to lwIP. See attachment.

2- Added the following to mac.h:

Code: Select all

/**
 * @brief   Reserves some memory before each receive buffer.
 * @details In zero copy mode this is useful to place structures needed by
 *          the IP code to handle a chain of buffers.
 */
#if !defined(MAC_RXBUF_PAD_SIZE) || defined(__DOXYGEN__)
#define MAC_RXBUF_PAD_SIZE          0
#endif


When a MAC driver supports zero copy it must allocate unused MAC_RXBUF_PAD_SIZE bytes before each receive buffer, and that must be enough space for the struct pbuf and ETH_PAD_SIZE. Among other code, I added this to lwipthread.c:

Code: Select all

/* Minimum required receive buffer front padding.*/
#define MIN_MAC_RXBUF_PAD_SIZE \
  (LWIP_MEM_ALIGN_SIZE(sizeof(struct pbuf)) + ETH_PAD_SIZE)


Code: Select all

#if MAC_USE_ZERO_COPY
  /* Make sure receive buffer padding is enough.*/
  chDbgAssert(MAC_RXBUF_PAD_SIZE >= MIN_MAC_RXBUF_PAD_SIZE,
              "lwip_thread() #1",
              "MAC_RXBUF_PAD_SIZE too small");
#endif


The result is small and simple, and IMHO it can't be cleaner than this (or in other words, this is a less dirty as this can be done without making important changes to lwIP that would require knowledge I don't have and testing time I can't afford).

I will submit the full patch tomorrow after some more testing.

Cheers.
Attachments
lwip-1.4.1-zero-copy.zip
(1.55 KiB) Downloaded 177 times

iggarpe
Posts: 129
Joined: Sun Sep 30, 2012 8:32 pm

Re: STM32 MAC driver optimization

Postby iggarpe » Fri Jan 18, 2013 12:03 pm

Giovanni,

After sleeping over it, I honestly think that the MAC low level driver receive buffer front padding is as of now, by far, the BEST option to implement zero-copy reception.

Rationale:

1- It is dead easy to implement receive buffer front padding in a low level driver which is already capable of zero-copy.

2- Allows use of ETH_PAD_SIZE, which would be impossible otherwise (see note 1).

3- Makes the lwipthread zero-copy code SUPER simple:

3.1- You do not need a separate pool for the struct pbuf, just need to initialize it in the padding area of the buffer that mac_lld_get_next_receive_buffer returns.

3.2- You do not need to worry about the separate pool having the same number of elements as MAC low level driver buffers (less will result in dropped packets, more will waste memory).

4- Requires a trivial, minimal patch to lwIP (see note 2). This is good because (a) one can be pretty much sure it does not break anything and (b) the patch is dead easy to maintain and port to future lwIP versions (see note 3).



Note 1: remember that the buffers are owned by the low level driver, there is no way you can place additional ETH_PAD_SIZE space in front of it because the buffer is given to you already filled with received data.

Note 2: the patch is trivial because the mechanism allows receive pbufs to work EXACTLY as normal PBUF_RAM/PBUF_POOL where the struct pbuf is contiguous with the actual buffer. Separating them (as is done in PBUF_ROM/PBUF_REF) would require a more elaborated patch and an INSANE amount of testing to make sure no lwIP system is broken because it makes idiotic assumptions.

Note 3: we could sit and wait forever for lwIP developers to implement and test the required pbuf management changes. From what I have read, users have been requesting exactly that at least from 2003 (see http://lists.gnu.org/archive/html/lwip-users/2003-03/msg00085.html).



BTW, I though that now that I'm down deep into this I may as well implement zero copy transmission, however, I think it's not going to be very useful without huge, deep and inpractical changes in lwIP because of the usual special requirements of MAC periphreals regarding the DMA buffers (alignment, size, memory areas, etc). I believe the STM32 MAC is an exception regarding size and alignment, but still can read only from SRAM, and not all SRAM areas. I could make the driver "reject" a buffer located in ROM and force lwipthread to make a copy of the pbuf in SRAM, or have an internal scratch area and do the copying internally, but the code would be very complex.

That said, I think there is some room for easy improvement of the transmit path: instead of N transmit buffers of 1522 bytes, allocate just a monolithic single buffer and use it as a ring buffer. That way, if frames are small you can place many more in it and return immediately.

mabl
Posts: 417
Joined: Tue Dec 21, 2010 10:19 am
Location: Karlsruhe, Germany
Been thanked: 1 time
Contact:

Re: STM32 MAC driver optimization

Postby mabl » Fri Jan 18, 2013 12:11 pm

Hi iggarpe,

iggarpe wrote:As soon as I reduced the amount of buffers to a more reasonable number, TCP and ICMP stopped working. From what I have read some code reuses the incoming pbufs to send a reply (at least ICMP does for sure, I suppose TCP does too). So my UDP services work fine but TCP and ICMP are broken.

So, in light of this new information, looks like zero copy reception is just NOT POSSIBLE in lwIP as is (i.e. using pbuf_custom functionality).

Is this problem solved now?

Also, I think the STM32 allows applying time stamps to the data packages, could this padding area also be used to save this time-stamp?

Sorry for the unqualified questions, I'm sadly to busy to follow the forum in all detail these days :-/

iggarpe
Posts: 129
Joined: Sun Sep 30, 2012 8:32 pm

Re: STM32 MAC driver optimization

Postby iggarpe » Fri Jan 18, 2013 1:44 pm

mabl wrote:Hi iggarpe,

iggarpe wrote:As soon as I reduced the amount of buffers to a more reasonable number, TCP and ICMP stopped working. From what I have read some code reuses the incoming pbufs to send a reply (at least ICMP does for sure, I suppose TCP does too). So my UDP services work fine but TCP and ICMP are broken.

So, in light of this new information, looks like zero copy reception is just NOT POSSIBLE in lwIP as is (i.e. using pbuf_custom functionality).

Is this problem solved now?


Yes. The problem is that lwIP code makes assumptions about the nature of the input pbufs. So far I've identified one place: pbuf_header in pbuf.c, but the problem is that it seems it may be making these assumptions in a lot of other places, and I'm not up to the task of fixing that. More precisely I'm not up to the task of thoroughly testing the fix, and I won't use such code for a critical part of my application. So the solution has been to use a receive pbuf structure that mimics pbufs of type PBUF_RAM/PBUF_POOL which seem the ones lwIP is assuming will be provided by the low level input layer, so I'm sure I'm not breaking anything.

Just for the record, the problem in pbuf_header is as follows:

- With a negative argument, pbuf_header is used to move the payload pointer to eat up headers as the packet travels upwards through the different layers on the INPUT path.

- With a positive argument, pbuf_header is used to make space in the pbuf for headers as the packet travels downwards through the different layers on the OUTPUT path.

- Positive arguments cannot be used with pbufs of type PBUF_REF/PBUF_ROM, because one does not know what's in before the buffer. In theory a pbuf of type PBUF_REF could be used if the payload pointer had been previously advanced by calling pbuf_header with a negative argument, but the point is that even if that happened before it is not recorded, and the original payload pointer at the time of creation of the PBUF_REF is not stored anywhere, so pbuf_header has no way to know whether it is safe to make more space by substracting a value from the payload pointer.

- Positive arguments can be used with pbufs of type PBUF_RAM/PBUF_POOL, but only if enough space was reserved during creation or pbuf_header has been previously called with negative arguments. How is this checked ?, by assuming the struct pbuf and the buffer are contiguous and making sure the payload pointer does not go below the struct pbuf pointer plus the size of the struct pbuf. So for this check to work it is required that the struct pbuf is contiguous with the buffer. You can't even use a custom pbuf with extra fields at the end, because the check uses sizeof(struct pbuf).

- So what's the problem ?... we are talking about a receive pbuf so if pbuf_header is to be called it will always be with a negative argument to remove headers, right ?. WRONG. lwIP reuses pbufs for transmission, and existing applications can do it too. For example, the ICMP code will use the receive pbuf chain of a PING to build and send the reply. This is very efficient and blah blah blah, but then you have a pbuf traveling downwards the output path and you need pbuf_header to work on that pbuf with positive arguments.

Also, I think the STM32 allows applying time stamps to the data packages, could this padding area also be used to save this time-stamp?


No, not really. Plus making use of the timestamping feature would require an extensive rewrite of the MAC low level driver.

User avatar
Giovanni
Site Admin
Posts: 13012
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 744 times
Been thanked: 620 times
Contact:

Re: STM32 MAC driver optimization

Postby Giovanni » Fri Jan 18, 2013 3:04 pm

I still think that we should inform the lwIP group about our findings, I imagine they are interested in improvements as well.

Giovanni


Return to “Development and Feedback”

Who is online

Users browsing this forum: No registered users and 3 guests