LWIP DHCP problem.

ChibiOS public support forum for topics related to the STMicroelectronics STM32 family of micro-controllers.

Moderators: RoccoMarco, barthess

rew
Posts: 380
Joined: Sat Jul 19, 2014 12:59 pm
Has thanked: 2 times
Been thanked: 13 times

LWIP DHCP problem.

Postby rew » Thu Aug 27, 2020 10:04 am

I've built a board that has a PHY and ethernet connector. The CPU is STM32F746.

I've set the lwip configuration to do DHCP

Now when the board boots, there seems to be an 80% chance of none of the DHCP packets making it onto the wire.... And in 20% of the cases I see a neat "DHCP DISCOVERY", OFFER, REQUEST, ACK and then the IP is assigned. When it doesn't work, LWIP seems to retry after 4 or 5 minutes. And again it sometimes works. I observed it today getting an IP after 10 minutes.

I have patched the lwipthread source file to make a copy of the packet that is being sent. Those look ok. Same for incoming packets.

When stuff isn't working I see normal broadcasts on my network coming in. The RX path is working.

When it isn't working, after some 4 minutes it decides to assign an 169.254.x.y IP address. So at that moment it will do a bunch of gratuitous arps for that IP address... Those hit the wire just fine. Also broadcasts just like the DHCP packets. Just the DHCP packets are not making it through.

Anybody seen something like this before? Suggestions on what to try to debug this? I'm not too keen on trying to capture the highspeed signals between the CPU and PHY to see if the packets are still present at that point in the pipeline. (the board doesn't have provisions so I'd be soldering to small wires or chip-pins possibly resulting in shorts etc etc.)

Observation: It works almost all of the time the first boot after reflashing the firmware. To rule out UDP checksums being wrong: I'm now zeroing out the UDP checksum for DHCP packets. Same thing: Worked first time after flashing that firmware and then not anymore.

psyco
Posts: 21
Joined: Fri May 22, 2020 1:40 am
Been thanked: 11 times

Re: LWIP DHCP problem.

Postby psyco » Fri Aug 28, 2020 1:13 am

Actually I have seen this before, and it slipped my mind on reporting it.

In my scenario, I was immediately calling lwipInit after chSysInit and it would delay DHCP by around a minute (might have been more, this was a while ago). Try adding a sleep between chSysInit and lwipInit (10 ms should do). If you're hitting the same thing I did, this should "fix" it.

The cause I was narrowing down to was a time jump in sys_now (lwip_bindings/arch/sys_arch.c) during initialization. With 10kHz systick and immediately calling lwipInit, sys_now would initially return 429,496,731 (((u32_t) 0 - 1) / 10 + 1) when chVTGetSystemTimeX returns 0. I'm not sure of the point of subtracting 1 from the systick and adding 1 as a millisecond. After initialization, the next calls would return 0, then 1, etc. I never got further than that in diagnosing this.

In my extra paranoia of LwIP potentially acting weird with time jumps past initialization, I implemented a critical section in sys_now that basically delta'd chVTGetSystemTimeX, tracked a mod of the number of systicks per ms, and a variable for the total ms. Not entirely sure if that is necessary, though.

rew
Posts: 380
Joined: Sat Jul 19, 2014 12:59 pm
Has thanked: 2 times
Been thanked: 13 times

Re: LWIP DHCP problem.

Postby rew » Fri Aug 28, 2020 8:22 am

Oh! But that's a serious problem!

That "jump" as you call it, is a simple "wraparound".

This means that this same bug will happen after 4.95 days and 9.9 days and.... Now, it is unlikely that a DHCP sequence is initiated precisely at that time, but if this happens in the DHCP code, isn't it likely that it also happens in the TCP code? And "ongoing TCP connections" are much more likely than a DHCP sequence!

With wraparounds, if you do things "just right" it will "just work".

So if you set time_to_do_something = now()+100; and then do if (now() < time_to_do_something ) then things go wrong: When this code happens to run at "wraparound_time - 98", then "time_to_do_something" will become "1" (*). And because it looks as if the "time_to_do_something" is smaller than "now()" it looks as if now has progressed pas time_to_do_something, so something will happen too early. It is also possible to get a "wait-for-the-pigs-to-fly" situation.

But "just right":

Code: Select all

start_time = now ();
[...]
dt = now()-start_time;
if (dt > 100) ....
even in the face of wraparounds, the dt will be smaller than 100 even when there is a wraparound!

Now I must say that I have known this for quite a while, but last year I've had to fly out to some site to fix a bug of this type. Getting it right is definitively NOT trivial! Testing is required. IIRC, the Linux kernel has "handy macros" for this stuff, so that everybody tends to use the handy macros instead of writing hard-to-verify home-made-code each time....


(*) or 2 depending on your definition of the "wraparound time"

rew
Posts: 380
Joined: Sat Jul 19, 2014 12:59 pm
Has thanked: 2 times
Been thanked: 13 times

Re: LWIP DHCP problem.

Postby rew » Fri Aug 28, 2020 1:09 pm

OK. I tested your suggestion: That was not it. :-(

I've added a delay of 100ms and 1000ms but: no dice. Twice it worked first time after flashing new firmware: "Party!" but alas, then a second boot after reset does not work like before. :-(
I have now added code that I can configure my system to not do "lwipinit" at boot at all. I can then type dhcp init at my leisure and trigger the lwip_init when I feel like it. Again, no dice!

psyco
Posts: 21
Joined: Fri May 22, 2020 1:40 am
Been thanked: 11 times

Re: LWIP DHCP problem.

Postby psyco » Sat Aug 29, 2020 1:30 am

I hesitated calling it a wraparound since it is discontinuous. When it wraps, LwIP sees a jump of 44 days worth of time. In your example, with start_time being 429,496,731, and the second now () being 0, dt ends up being 3,865,470,565.

Once it gets an IP, is TCP communication good? If so, maybe a race between the link up and the sending of the initial DHCP probe? Do you have the cache invalidation sorted out (or DMA memory not cached)?

Not sure what else. I haven't had any issues like that with an F76x chip and LwIP.

rew
Posts: 380
Joined: Sat Jul 19, 2014 12:59 pm
Has thanked: 2 times
Been thanked: 13 times

Re: LWIP DHCP problem.

Postby rew » Sat Aug 29, 2020 9:55 pm

If you do the wraparounds well, the dt will be negative when appropriate.

Say lets do this with bytes.

Is 40+10 > 45 ? Yes, 50 > 45
is 250+10 > 255 ? well... no, in byte 250+10 = 4

but when we do: 40+10-45 = 5, positive: yes, bigger.
250+10-255 = 5, positive: yes bigger.

My project doesn't need TCP, so I haven't tested with TCP. But everything UDP works.
Also: It is not just the first DHCP packet that gets lost. I see the retransmit a few seconds later: lwip submits it to the routine in lwipthread.c that is meant to pass it on to the mac driver and there I get to take a copy of it and increment the "number of packets sent" counter. All that happens.

A race would possibly mess with the very first probe, but not with the repeat 2 seconds later...

Hmm. The DMA/cache is a good remark. I have to look into that again and see what needs to be done. I would really prefer that such code was in the device drivers, but you're right. I need to modify chibios sources to make it work on "with-cache" CPUs.

Still odd that I was able to modify 1 million packets and verify the modifications when they come back....

rew
Posts: 380
Joined: Sat Jul 19, 2014 12:59 pm
Has thanked: 2 times
Been thanked: 13 times

Re: LWIP DHCP problem.

Postby rew » Mon Aug 31, 2020 1:17 pm

@psyco

You're my hero! Thanks very much for helping. you found it!

Adding:

Code: Select all

cacheBufferFlush (tdp->physdesc->tdes2, tdp->offset);

to line 469 in ./hal/ports/STM32/LLD/MACv1/hal_mac_lld.c in mac_lld_release_transmit_descriptor fixes it !

Giovanni, can I ask a question: I don't see a good way to do this in my application. Shouldn't this be in the driver? I don't think there is a big performance hit if you make a "flush the cache if this cpu has a cache" define that evaluates to nothing when there is no cache.

In 2015 you wrote: viewtopic.php?f=3&t=2767
Leave it to the application [...]
Also imagine the support nightmare, most people is not able to debug this kind of stuff.

and I fully agree, but this seems to be the approach taken nowadays. Lets just say that this cost me lots of time to find.....

In this case, the abstraction of the driver hides the buffer from the application: the "application" (in this case os/various/lwipthread.c) does not have access to the DMA buffer, but just submits chunks of data to transmit that get memcpy-ed to the DMA buffer and eventually "closes" the packet.

User avatar
Giovanni
Site Admin
Posts: 14457
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1076 times
Been thanked: 922 times
Contact:

Re: LWIP DHCP problem.

Postby Giovanni » Mon Aug 31, 2020 1:28 pm

It is best to place all the ETH ram buffers and descriptors in non-cacheable memory, handling the cache manually can leave hard to debug issues.

Giovanni

rew
Posts: 380
Joined: Sat Jul 19, 2014 12:59 pm
Has thanked: 2 times
Been thanked: 13 times

Re: LWIP DHCP problem.

Postby rew » Mon Aug 31, 2020 2:08 pm

My application (the one that calls "lwip_init" with the options struct to request DHCP) does NOTHING with buffers. It is the "mac driver in chibios that allocates a buffer for the packet when macWaitTransmitDescriptor is called from os/various/lwip/lwipthread.c (i.e. also chibios code) .

mikeprotts
Posts: 166
Joined: Wed Jan 09, 2019 12:37 pm
Has thanked: 19 times
Been thanked: 31 times

Re: LWIP DHCP problem.

Postby mikeprotts » Wed Sep 02, 2020 4:01 pm

The definition of the RAM areas needs to be correct, are you using the STM32F746xG_ETH.ld or STM32F746xG_MAX.ld linker script?

Mike


Return to “STM32 Support”

Who is online

Users browsing this forum: No registered users and 50 guests