We were about to ship our product to customers when it turned out that we have a sporadic glitch in the USB stack. I am a bit lost at the moment, so any advice would be highly appreciated.
Let me first describe what I'm dealing with and what happened. The application is based on STM32F446 with ChibiOS 16.x, and it uses the USB CDC ACM driver provided with ChibiOS. The USB CDC ACM stack is configured according to the examples distributed with ChibiOS, with only minor deviations; it used to work great with all meaningful scenarios under different host OS. The buffers for the driver were configured as follows until a very recent, very last minute change to the firmware:
Code: Select all
#define SERIAL_USB_BUFFERS_SIZE 4096
#define SERIAL_USB_BUFFERS_NUMBER 2
The buffers have to be large because we move a lot of data between the device and the host. However, the last-minute change I mentioned forced me to reduce the buffers because I ran out of RAM, so the new configuration was:
Code: Select all
#define SERIAL_USB_BUFFERS_SIZE 2048
#define SERIAL_USB_BUFFERS_NUMBER 2
Everything still worked properly, until I connected the device to a virtual machine. The virtual machine is managed by Virtual Box, the host OS is Linux Mint 18 Sarah (based on Ubuntu 16.04 Xenial, kernel 4.4.0), and the guest OS is Ubuntu 16.04 Xenial (kernel 4.13.0). I started a certain application on the guest OS which started to communicate with the device over USB, and shortly after that the firmware triggered an assertion failure here (marked):
Code: Select all
void ibqPostFullBufferI(input_buffers_queue_t *ibqp, size_t size) {
osalDbgCheckClassI();
osalDbgCheck((size > 0U) && (size <= (ibqp->bsize - sizeof (size_t)))); // <--- Here, size == 0
osalDbgAssert(!ibqIsFullI(ibqp), "buffers queue full");
/* Writing size field in the buffer.*/
*((size_t *)ibqp->bwrptr) = size;
/* Posting the buffer in the queue.*/
ibqp->bcounter++;
ibqp->bwrptr += ibqp->bsize;
if (ibqp->bwrptr >= ibqp->btop) {
ibqp->bwrptr = ibqp->buffers;
}
/* Waking up one waiting thread, if any.*/
osalThreadDequeueNextI(&ibqp->waiting, MSG_OK);
}
...which is invoked from:
Code: Select all
void sduDataReceived(USBDriver *usbp, usbep_t ep) {
SerialUSBDriver *sdup = usbp->out_params[ep - 1U];
if (sdup == NULL) {
return;
}
osalSysLockFromISR();
/* Signaling that data is available in the input queue.*/
chnAddFlagsI(sdup, CHN_INPUT_AVAILABLE);
/* Posting the filled buffer in the queue.*/
ibqPostFullBufferI(&sdup->ibqueue,
usbGetReceiveTransactionSizeX(sdup->config->usbp, // <-- Here, usbGetReceiveTransactionSizeX() returns 0
sdup->config->bulk_out));
/* The endpoint cannot be busy, we are in the context of the callback,
so a packet is in the buffer for sure. Trying to get a free buffer
for the next transaction.*/
sdu_start_receive(sdup);
osalSysUnlockFromISR();
}
Here is the full context in the form of a GDB log (lots of text):
Code: Select all
Reading symbols from build/compound.elf...done.
Target voltage: 4.3V
Available Targets:
No. Att Driver
1 STM32F4xx
Error while running hook_stop:
No symbol "disassembly" in current context.
0x0800db4c in chSysHalt (reason=reason@entry=0x8049c10 <__func__.8934> "ibqPostFullBufferI") at chibios/os/rt/src/chsys.c:203
203 CH_CFG_SYSTEM_HALT_HOOK(reason);
Catching vectors: reset
gdb$ set print pretty
gdb$ backtrace
#0 0x0800db4c in chSysHalt (reason=reason@entry=0x8049c10 <__func__.8934> "ibqPostFullBufferI") at chibios/os/rt/src/chsys.c:203
#1 0x0800f21a in ibqPostFullBufferI (ibqp=ibqp@entry=0x200099d0 <board::usb::(anonymous namespace)::SDU1+12>, size=0x0) at chibios/os/hal/src/hal_buffers.c:152
#2 0x0801034a in sduDataReceived (usbp=<optimized out>, ep=<optimized out>) at chibios/os/hal/src/hal_serial_usb.c:467
#3 0x080118ac in otg_epout_handler (usbp=usbp@entry=0x20003938 <USBD1>, ep=ep@entry=0x2) at chibios/os/hal/ports/STM32/LLD/OTGv1/hal_usb_lld.c:443
#4 0x08011dbc in usb_lld_serve_interrupt (usbp=0x20003938 <USBD1>) at chibios/os/hal/ports/STM32/LLD/OTGv1/hal_usb_lld.c:625
#5 Vector14C () at chibios/os/hal/ports/STM32/LLD/OTGv1/hal_usb_lld.c:651
#6 <signal handler called>
#7 0x55555554 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
gdb$ up 2
#2 0x0801034a in sduDataReceived (usbp=<optimized out>, ep=<optimized out>) at chibios/os/hal/src/hal_serial_usb.c:467
467 ibqPostFullBufferI(&sdup->ibqueue,
gdb$ info locals
sdup = 0x200099c4 <board::usb::(anonymous namespace)::SDU1>
gdb$ print ((sdup->config->usbp)->epc[sdup->config->bulk_out]->out_state->rxcnt)
$1 = 0x0
gdb$ print *sdup
$2 = {
vmt = 0x8049db0 <vmt>,
event = {
next = 0x2000e500 <serial::init(std::function<bool (char const*)> const&)::serial_thread+8704>
},
state = SDU_READY,
ibqueue = {
waiting = {
next = 0x200099d0 <board::usb::(anonymous namespace)::SDU1+12>,
prev = 0x200099d0 <board::usb::(anonymous namespace)::SDU1+12>
},
suspended = 0x0,
bcounter = 0x0,
bwrptr = 0x20009a40 <board::usb::(anonymous namespace)::SDU1+124> "\033",
brdptr = 0x20009a40 <board::usb::(anonymous namespace)::SDU1+124> "\033",
btop = 0x20009ac8 <board::usb::(anonymous namespace)::SDU1+260> "@",
bsize = 0x44,
bn = 0x2,
buffers = 0x20009a40 <board::usb::(anonymous namespace)::SDU1+124> "\033",
ptr = 0x0,
top = 0x20009a8f <board::usb::(anonymous namespace)::SDU1+203> ".l.curr_mul",
notify = 0x8010001 <ibnotify>,
link = 0x200099c4 <board::usb::(anonymous namespace)::SDU1>
},
obqueue = {
waiting = {
next = 0x20009a08 <board::usb::(anonymous namespace)::SDU1+68>,
prev = 0x20009a08 <board::usb::(anonymous namespace)::SDU1+68>
},
suspended = 0x0,
bcounter = 0x1,
bwrptr = 0x20009b0c <board::usb::(anonymous namespace)::SDU1+328> "@",
brdptr = 0x20009ac8 <board::usb::(anonymous namespace)::SDU1+260> "@",
btop = 0x20009b50 <board::usb::(anonymous namespace)::SDU1+396> "",
bsize = 0x44,
bn = 0x2,
buffers = 0x20009ac8 <board::usb::(anonymous namespace)::SDU1+260> "@",
ptr = 0x0,
top = 0x20009b0c <board::usb::(anonymous namespace)::SDU1+328> "@",
notify = 0x800ff61 <obnotify>,
link = 0x200099c4 <board::usb::(anonymous namespace)::SDU1>
},
ib = "\033\000\000\000\216\001\000\020mid.phi.curr_mul\000\377K8xr\216\377\222\226_\365\216\b\216\020Ɏty\000\377\343Wv,\216", '\000' <repeats 17 times>, "\a\000\000\000\216\003\245\240-A\216.l.curr_mul\000\377\363\322\060\001\216.\216\000\377&\212\254\060\216on\000\377\214\307\327\215\216\216v,\216", '\000' <repeats 16 times>,
ob = "@", '\000' <repeats 27 times>, "\002", '\000' <repeats 39 times>, "@\000\000\000\000ˌ\316\357\000\000\000\000\064\326\060 \026\000\000\000\003", '\000' <repeats 22 times>, "\001", '\000' <repeats 22 times>,
config = 0x804cb00 <board::usb::(anonymous namespace)::serusbcfg1>
}
gdb$ print *(sdup->config->usbp)
$3 = {
state = USB_ACTIVE,
config = 0x804cc00 <board::usb::(anonymous namespace)::usbcfg>,
transmitting = 0x4,
receiving = 0x0,
epc = {0x8049f20 <ep0config>, 0x804cc10 <board::usb::(anonymous namespace)::ep1config>, 0x804cc40 <board::usb::(anonymous namespace)::ep2config>, 0x0, 0x0, 0x0, 0x0},
in_params = {0x200099c4 <board::usb::(anonymous namespace)::SDU1>, 0x200099c4 <board::usb::(anonymous namespace)::SDU1>, 0x0, 0x0, 0x0, 0x0},
out_params = {0x0, 0x200099c4 <board::usb::(anonymous namespace)::SDU1>, 0x0, 0x0, 0x0, 0x0},
ep0state = USB_EP0_WAITING_SETUP,
ep0next = 0x20002100 <linecoding> "",
ep0n = 0x7,
ep0endcb = 0x0,
setup = "! \000\000\000\000\a",
status = 0x0,
address = 0x30,
configuration = 0x1,
saved_state = USB_READY,
otg = 0x50000000,
otgparams = 0x8049f50 <fsparams>,
pmnext = 0xa4,
txpending = 0x4,
wait = 0x0,
tr = 0x20004b98 <USBD1+4704>,
wa_pump = {0x5555555555555555 <repeats 534 times>, 0x20004b982000dbd8, 0x2000dad4200039c8, 0x800e191ffffffff, 0x810002000800c3ae, 0x5555555555555555, 0x5555555555555555, 0x2000387055555555, 0x10000000800c3e6, 0x5555555555555555, 0x5555555555555555, 0x800e35f55555555, 0x800da91, 0x17d4bce00000000, 0x5555555500000003, 0x5555555555555555, 0x5555555555555555, 0x5555555500000000, 0x5000000020003908, 0xc6000200060000, 0x800e96900000000, 0x810000000801117c, 0x5555555555555555, 0x5555555555555555, 0x2000387055555555, 0x5555555520003820, 0x5555555555555555, 0x5555555555555555, 0x800e35f55555555, 0x800da91, 0x801120100000000, 0x5000000000000008, 0x80110d155555555, 0x5555555520003938, 0x5555555555555555, 0x5555555555555555, 0x800c3cb55555555, 0x2000387020003870, 0x20004ae400000094, 0x200038202000dbd8, 0x200039c808049f60, 0x55010001, 0x20004bc0, 0x9400000000, 0x19c80000021f, 0x9eae01194d9, 0x75b278}
}
I did some more experimenting and noticed that the probability of the occurrence of the error varies unpredictably if I change the size of the buffers. Among the tested values, 2x128 bytes seemed to trigger the problem most often; in general, larger buffers seemed to be less likely to trigger the failure.
At this point, I suspected that sometimes the host might be emitting USB bulk data transfers with an empty payload, which may not be managed by the firmware properly, hence why the transfer size was zero in sduDataReceived(). So I decided to start Wireshark and collect two data dumps - one where the stack is known to function correctly, and the other where the failure is triggered. As it turned out, the host never emits empty bulk data transfer requests, so my first assumption has not been confirmed; however, the failure seems to correlate with the following interesting condition on the bus (also see the attached capture file):
The captured fragment immediately precedes the failure. The line highlighted in black is the last response we got from the device, immediately after that it triggered the assertion trip. This segment is interesting because, as you can see from the timestamps, the device took 13 milliseconds (!) to process the write request, and the request-response sequence was interrupted by two other concurrent data-IN requests. A few tests show that the long request-response cycle combined with concurrent data-IN requests seem to reliably predict the failure.
The fact that the problem seems to correlate with concurrent requests suggest that there might be a race condition in the driver, causing usbGetReceiveTransactionSizeX() return incorrect (zero) transfer size?
Additionally, I noticed that the failure seems to be marginally less likely to occur while the USB traffic is monitored and the usbmon driver is loaded. This observation might support the race condition hypothesis unless I'm completely misunderstanding how things work under the hood.
Any help would be appreciated! Let me know if there's any additional data you'd like to have.