STM32 MAC driver optimization

This forum is dedicated to feedback, discussions about ongoing or future developments, ideas and suggestions regarding the ChibiOS projects are welcome. This forum is NOT for support.
iggarpe
Posts: 129
Joined: Sun Sep 30, 2012 8:32 pm

STM32 MAC driver optimization

Postby iggarpe » Mon Dec 24, 2012 7:27 pm

Hi all,

I'm building an application which has the quite restrictive requirement of being able to cope with network stall conditions that result in burst packet arrival (while in normal conditions they arrive one every 20ms). In this scenario available RAM is the most restrictic factor, and as of now, the ChibiOS STM32 MAC driver implementation does not cut it.

Problem has been commented in a thread in the support board, but I'll summarize it here:

- mac_lld buffering is wasteful: 1522 bytes (actually 1524 because of round up to alignment) per packet. If packets are small (as they are in my case) a lot of RAM is wasted.

- lwipthread is copied from mac_lld buffers to lwIP pbufs allocated from own lwIP pool. Packets are either in mac_lld buffers or in lwIP pbufs, but never in both. In order to cope with a burst of, say, 100 packets, I must allocate 100x1522 in mac_lld AND 100 pbufs in lwIP. Not enough SRAM.

I've studied the STM32 MAC peripheral and after giving it some though I've come up with a solution that involves:

1- Reduce the mac_ldd buffer size to, say, 256 bytes. The STM32 MAC is capable of storing a frame in multiple DMA descriptors, and with a granularity of 256 bytes (instead of 1522) much less memory would be wasted.

2- Implement zero copy on reception (see note at the end on transmission).

While It's quite clear to me how to implement this, I'm not so sure about how to do it in the less disruptive way so it can be easily integrated in ChibiOS and not break other MAC drivers, so I'm hereby requesting your advice and comments.

(I think (2) can be implemented without touching the mac_lld receive API)

Currently the mac_lld API for reception comprises three functions:

(a) mac_lld_get_receive_descriptor
(b) mac_lld_read_receive_descriptor
(c) mac_lld_release_receive_descriptor

The receive descriptor is an opaque structure belonging to mac_lld. (a) finds a received packet, (b) reads from it in several steps (necessary because data is to be placed in a pbuf chain of probably several pbufs) and (c) releases all buffers used for this frame so they're reverted to DMA for storage of new incoming packets.

As I see it, I think the only way to implement zero copy is to, instead of imposing mac_lld a read to a provided buffer (mac_lld_read_receive_descriptor), make repeated calls to a mac_lld function that will return chunks of memory as pairs of pointer/size until it returns telling there are no more chunks for this packet. lwipthread would allocate PBUF_ROM (or PBUF_REF, not sure) instead of PBUF_POOL and fill them in with the retrieved pointer/size pairs. Finally, lwipthread WOULD NOT call mac_lld_release_descriptor but instead set up a callback mechanism such that when the pbufs are freed later, we tell mac_lld to reuse those buffers.

lwIP already supports "custom" pbufs that include a deallocation callback, though they are only enabled when IP_FRAG is. A tiny patch would allow their enable independent of IP_FRAG.

Also, lwipthread would need some rewriting, because I understand that some architectures may not be able to provide a pointer/size pair, and the mac_lld_read_receive_descriptor may be more appropriate in their case.

Finally, note that there is an important distinction between the current and the proposed buffer release mechanism. Current works on the descriptor (which mac_lld internally interprets to release one or many buffers, whatever it represents), but proposed would work at a buffer level, because once each memory buffer goes into a pbuf they become basically independent to the lwIP memory management mechanism (though we know they're not because they all belong to the same packet pbuf chain).

I find it quite difficult to reconciliate both mechanism with a single API. Also (correct me if I'm wrong), imposing a MAC API that requires mac_lld to provide pointer/size pairs may mean some drivers would need quite a bit of work, particularly those not supporting DMA: the driver would need to manage it self buffer allocation and copy.

Maybe we could use #defines to choose between a DMA zero-copy oriented API and the current one.

Thoughs?


NOTE: I have less of a problem with transmission, where there is no problem if we use only two transmit buffers and let the outgoing packets pile up in the lwIP layer. Nonetheless, I think it is also quite easy to implement zero copy transmit, just not so sure if it is worth the trouble. The main catch is that the STM32 MAC DMA cannot read from ROM (flash), only from SRAM (excluding the CCM area). That means that outgoing pbufs would have to be searched for buffers in "offending" addresses and then allocate some RAM and copy those contents there so DMA can read them.

User avatar
Giovanni
Site Admin
Posts: 13012
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 744 times
Been thanked: 620 times
Contact:

Re: STM32 MAC driver optimization

Postby Giovanni » Mon Dec 24, 2012 7:52 pm

Hi,

I agree with most you wrote, basically I see things in the same way. I can modify the API of the MAC driver to return descriptors that are also pairs like you described and read/write accessors like we have now.
Few points:
1) Some Ethernet peripherals do not have buffers but FIFOs, those would:
1.A) No implement pairs but only accessors.
2.B) Simulate buffers.
2) Large buffers are more efficient for performance, see the AT91 MAC driver and see the mess that is scanning a list of fragments, I decided to not do the same for the STM32, it has a RAM memory large enough.
3) The pairs would be list of blocks of arbitrary size, we could not guarantee to have buffers of uniform size.

I already planned for changes along those lines but always stopped when trying to integrate with lwIP (it is not exactly the more readable code around frankly). If you will take care of the lwIP integration and the changes to the lwip thread I could change the MAC API to return the enhanced descriptors.

Giovanni

iggarpe
Posts: 129
Joined: Sun Sep 30, 2012 8:32 pm

Re: STM32 MAC driver optimization

Postby iggarpe » Thu Dec 27, 2012 10:11 pm

First, while reworking the STM32 mac_lld I think I caught a bug:

Code: Select all

#if STM32_IP_CHECKSUM_OFFLOAD
        && !(rdes->rdes0 & STM32_RDES0_FT & (STM32_RDES0_IPHCE |
                                             STM32_RDES0_PCE))
#endif


I think that condition is always TRUE because of the wrong use of the bitwise AND. This is how it should be written:

(edited!)

Code: Select all

#if STM32_IP_CHECKSUM_OFFLOAD
        && !((rdes->rdes0 & STM32_RDES0_FT) &&
             (rdes->rdes0 & (STM32_RDES0_IPHCE | STM32_RDES0_PCE)))
#endif



Please confirm I did not drink too much coffee today :-)
Last edited by iggarpe on Thu Dec 27, 2012 11:25 pm, edited 1 time in total.

iggarpe
Posts: 129
Joined: Sun Sep 30, 2012 8:32 pm

Re: STM32 MAC driver optimization

Postby iggarpe » Thu Dec 27, 2012 10:32 pm

1) Some Ethernet peripherals do not have buffers but FIFOs, those would:
1.A) No implement pairs but only accessors.
2.B) Simulate buffers.


Ok, here if I had to make a wish, I'd go with B, because then MAC would only have the "pairs" API. Cleaner. Also, lwipthread would have code only for one scenario: getting pairs from the driver, placing them into pbufs and building the pbuf chain. Otherwise a #define would be needed to choose between that code and the current one that uses an accessor.

However, I realize it would require significant changes in some LLD drivers to internally manage a set of buffers. This is not too complex but also not trivial.

Question is, how many drivers do use FIFO ?.

2) Large buffers are more efficient for performance, see the AT91 MAC driver and see the mess that is scanning a list of fragments, I decided to not do the same for the STM32, it has a RAM memory large enough.


Well, didn't go into the details of the AT91 MAC, but it seems like the the peripheral is not very smart and leaves holes in the descriptor chain and other oddities. I think this is not the case of the STM32.

The STM32 MAC won't leave holes, just an incomplete (i.e. not end descriptor) chain, which is relatively easy to handle. Sure, it means walking the chain in the hope there is actually a end descriptor and then finding out there is not, but note this performance penalty is minor and happens only when either there are reception errors or a reception overflow (no free descriptors halfway a frame reception). In either case you're screwed anyway and the performance penalty is the least of your worries.

There is also a slight performance penalty in having to read N DMA descriptors instead of just one, but I believe this is negligible, mostly in light of the huge gains in memory usage in scenarios like mine.

3) The pairs would be list of blocks of arbitrary size, we could not guarantee to have buffers of uniform size.


No problem. These pairs are to be enclosed into pbufs in a pbuf chain. Only the pbuf structure is allocated in lwIP (and it is always the same size), and then the pointer/size are filled in. The driver is free to return chunks of any size it sees fit.

P.S: I just realized I haven't expressed yet what an impressive piece of work is ChibiOS. There. I said it :-)

User avatar
Giovanni
Site Admin
Posts: 13012
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 744 times
Been thanked: 620 times
Contact:

Re: STM32 MAC driver optimization

Postby Giovanni » Thu Dec 27, 2012 10:41 pm

I think both versions of that code do not make sense, I traced it back to a contributed patch that I didn't examined close enough.

Not sure how that expression should be modified, the answer is in table 143 of the F4 RM, I will give it a closer look tomorrow.

About the MAC LLD, before you make changes, I think there is an easy way to handle those "pairs" using existing descriptors. I am thinking to add a couple of functions that allow to scan that descriptor streams by "buffers" instead of doing read/write-like operations. The required changes to the driver are minimal and incremental. The old API could be maintained or discarded.

Like we now have mac_lld_read_receive_descriptor() there would be a mac_lld_get_next_buffer() returning a pointer to the next buffer and its size (parameters TBD). Probably something similar is possible also for transmit descriptors.

The lwip thread would get the buffers one by one and link them to a pbuf chain (do pbuf chains support pbufs of varying size? I hope so).

Giovanni

iggarpe
Posts: 129
Joined: Sun Sep 30, 2012 8:32 pm

Re: STM32 MAC driver optimization

Postby iggarpe » Thu Dec 27, 2012 10:44 pm

Giovanni,

I just realized the "while" loop below will ALWAYS perform ONE iteration, even when purging, because "rdes" is never updated (though macp->rxptr is) and in the next iteration the OWN bit will be set.

Is this the intended behaviour ?

Code: Select all

msg_t mac_lld_get_receive_descriptor(MACDriver *macp,
                                     MACReceiveDescriptor *rdp) {
  stm32_eth_rx_descriptor_t *rdes;

  chSysLock();

  /* Get Current RX descriptor.*/
  rdes = macp->rxptr;

  /* Iterates through received frames until a valid one is found, invalid
     frames are discarded.*/
  while (!(rdes->rdes0 & STM32_RDES0_OWN)) {
    if (!(rdes->rdes0 & (STM32_RDES0_AFM | STM32_RDES0_ES))
#if STM32_IP_CHECKSUM_OFFLOAD
        && !(rdes->rdes0 & STM32_RDES0_FT & (STM32_RDES0_IPHCE |
                                             STM32_RDES0_PCE))
#endif
        && (rdes->rdes0 & STM32_RDES0_FS) && (rdes->rdes0 & STM32_RDES0_LS)) {
      /* Found a valid one.*/
      rdp->offset   = 0;
      rdp->size     = ((rdes->rdes0 & STM32_RDES0_FL_MASK) >> 16) - 4;
      rdp->physdesc = rdes;
      macp->rxptr   = (stm32_eth_rx_descriptor_t *)rdes->rdes3;

      chSysUnlock();
      return RDY_OK;
    }
    /* Invalid frame found, purging.*/
    rdes->rdes0 = STM32_RDES0_OWN;
    macp->rxptr = (stm32_eth_rx_descriptor_t *)rdes->rdes3;
  }

  chSysUnlock();
  return RDY_TIMEOUT;
}

User avatar
Giovanni
Site Admin
Posts: 13012
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 744 times
Been thanked: 620 times
Contact:

Re: STM32 MAC driver optimization

Postby Giovanni » Fri Dec 28, 2012 8:41 am

Nope, it should aggressively scan the descriptors, the anomaly is self recovering because the next time the function is invoked the pointer is updated.

Giovanni

User avatar
Giovanni
Site Admin
Posts: 13012
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 744 times
Been thanked: 620 times
Contact:

Re: STM32 MAC driver optimization

Postby Giovanni » Fri Dec 28, 2012 11:44 am

I fixed both problems in repository, the performance seems to be increased, 12 continuous pings of 1472 size packets with no interval in parallel run without losses, I didn't add more than that, it feels much smoother.

I also fixed the expression in the CRC checking offload but I am not sure how to test that, forgot how to patch lwIP about that.

In the afternoon I will add the new API for copy-less operations, I have to think a bit about that first.

Giovanni

iggarpe
Posts: 129
Joined: Sun Sep 30, 2012 8:32 pm

Re: STM32 MAC driver optimization

Postby iggarpe » Fri Dec 28, 2012 3:56 pm

Quick note, I think in mac_lld.c line 538 you may as well update macp->rxptr outside of the loop and before chSysUnlock().

User avatar
Giovanni
Site Admin
Posts: 13012
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 744 times
Been thanked: 620 times
Contact:

Re: STM32 MAC driver optimization

Postby Giovanni » Fri Dec 28, 2012 4:05 pm

Hi,

This is the additional API I wish to implement:

Code: Select all


#if MAC_SUPPORTS_ZERO_COPY || defined(__DOXYGEN__)
/**
 * @brief   Returns a pointer to the next transmit buffer in the descriptor
 *          chain.
 * @note    The API guarantees that enough buffers can be requested to fill
 *          a whole frame.
 *
 * @param[in] tdp       pointer to a @p MACTransmitDescriptor structure
 * @param[in] size      size of the requested buffer. Specify the frame size
 *                      on the first call then scale the value down subtracting
 *                      the amount of data already copied into the previous
 *                      buffers.
 * @param[out] sizep    pointer to variable receiving the real buffer size.
 *                      The returned value can be less than the amount
 *                      requested, this means that more buffers must be
 *                      requested in order to fill the frame data entirely.
 * @return              Pointer to the returned buffer.
 *
 * @api
 */
#define macGetNextTransmitBuffer(tdp, size, sizep)                          \
  mac_lld_get_next_transmit_buffer(tdp, bufp)

/**
 * @brief   Returns a pointer to the next receive buffer in the descriptor
 *          chain.
 * @note    The API guarantees that the descriptor chain contains a whole
 *          frame.
 *
 * @param[in] rdp       pointer to a @p MACReceiveDescriptor structure
 * @param[out] sizep    pointer to variable receiving the buffer size, it is
 *                      zero when the last buffer has already been returned.
 * @return              Pointer to the returned buffer.
 * @retval NULL         if the buffer chain has been entirely scanned.
 *
 * @api
 */
#define magGetNextReceiveBuffer(rdp, sizep)                                 \
  mac_lld_get_next_receive_buffer(rdp, sizep)
#endif /* MAC_SUPPORTS_ZERO_COPY */


Both functions work using the current descriptors mechanism, basically the functions are list iterators. I think this should be sufficient to implement zero copy in both directions (even if transmission could be not possible with lwIP, to be verified).

Thoughs?

PS. I will look into that assignment.

Giovanni


Return to “Development and Feedback”

Who is online

Users browsing this forum: No registered users and 4 guests