USB Serial driver is misbehaving?

Discussions and support about ChibiOS/HAL, the MCU Hardware Abstraction Layer.
User avatar
Spym
Posts: 45
Joined: Tue Oct 15, 2013 10:20 am
Location: Tallinn, Estonia
Has thanked: 2 times
Been thanked: 3 times
Contact:

USB Serial driver is misbehaving?

Postby Spym » Wed Mar 21, 2018 1:52 pm

Greetings all, Giovanni,

We were about to ship our product to customers when it turned out that we have a sporadic glitch in the USB stack. I am a bit lost at the moment, so any advice would be highly appreciated.

Let me first describe what I'm dealing with and what happened. The application is based on STM32F446 with ChibiOS 16.x, and it uses the USB CDC ACM driver provided with ChibiOS. The USB CDC ACM stack is configured according to the examples distributed with ChibiOS, with only minor deviations; it used to work great with all meaningful scenarios under different host OS. The buffers for the driver were configured as follows until a very recent, very last minute change to the firmware:

Code: Select all

#define SERIAL_USB_BUFFERS_SIZE     4096
#define SERIAL_USB_BUFFERS_NUMBER   2

The buffers have to be large because we move a lot of data between the device and the host. However, the last-minute change I mentioned forced me to reduce the buffers because I ran out of RAM, so the new configuration was:

Code: Select all

#define SERIAL_USB_BUFFERS_SIZE     2048
#define SERIAL_USB_BUFFERS_NUMBER   2


Everything still worked properly, until I connected the device to a virtual machine. The virtual machine is managed by Virtual Box, the host OS is Linux Mint 18 Sarah (based on Ubuntu 16.04 Xenial, kernel 4.4.0), and the guest OS is Ubuntu 16.04 Xenial (kernel 4.13.0). I started a certain application on the guest OS which started to communicate with the device over USB, and shortly after that the firmware triggered an assertion failure here (marked):

Code: Select all

void ibqPostFullBufferI(input_buffers_queue_t *ibqp, size_t size) {

  osalDbgCheckClassI();

  osalDbgCheck((size > 0U) && (size <= (ibqp->bsize - sizeof (size_t))));   // <--- Here, size == 0
  osalDbgAssert(!ibqIsFullI(ibqp), "buffers queue full");

  /* Writing size field in the buffer.*/
  *((size_t *)ibqp->bwrptr) = size;

  /* Posting the buffer in the queue.*/
  ibqp->bcounter++;
  ibqp->bwrptr += ibqp->bsize;
  if (ibqp->bwrptr >= ibqp->btop) {
    ibqp->bwrptr = ibqp->buffers;
  }

  /* Waking up one waiting thread, if any.*/
  osalThreadDequeueNextI(&ibqp->waiting, MSG_OK);
}

...which is invoked from:

Code: Select all

void sduDataReceived(USBDriver *usbp, usbep_t ep) {
  SerialUSBDriver *sdup = usbp->out_params[ep - 1U];
  if (sdup == NULL) {
    return;
  }

  osalSysLockFromISR();

  /* Signaling that data is available in the input queue.*/
  chnAddFlagsI(sdup, CHN_INPUT_AVAILABLE);

  /* Posting the filled buffer in the queue.*/
  ibqPostFullBufferI(&sdup->ibqueue,
                     usbGetReceiveTransactionSizeX(sdup->config->usbp,  // <-- Here, usbGetReceiveTransactionSizeX() returns 0
                                                   sdup->config->bulk_out));

  /* The endpoint cannot be busy, we are in the context of the callback,
     so a packet is in the buffer for sure. Trying to get a free buffer
     for the next transaction.*/
  sdu_start_receive(sdup);

  osalSysUnlockFromISR();
}


Here is the full context in the form of a GDB log (lots of text):

Code: Select all

Reading symbols from build/compound.elf...done.
Target voltage: 4.3V
Available Targets:
No. Att Driver
 1      STM32F4xx
Error while running hook_stop:
No symbol "disassembly" in current context.
0x0800db4c in chSysHalt (reason=reason@entry=0x8049c10 <__func__.8934> "ibqPostFullBufferI") at chibios/os/rt/src/chsys.c:203
203       CH_CFG_SYSTEM_HALT_HOOK(reason);
Catching vectors: reset

gdb$ set print pretty

gdb$ backtrace
#0  0x0800db4c in chSysHalt (reason=reason@entry=0x8049c10 <__func__.8934> "ibqPostFullBufferI") at chibios/os/rt/src/chsys.c:203
#1  0x0800f21a in ibqPostFullBufferI (ibqp=ibqp@entry=0x200099d0 <board::usb::(anonymous namespace)::SDU1+12>, size=0x0) at chibios/os/hal/src/hal_buffers.c:152
#2  0x0801034a in sduDataReceived (usbp=<optimized out>, ep=<optimized out>) at chibios/os/hal/src/hal_serial_usb.c:467
#3  0x080118ac in otg_epout_handler (usbp=usbp@entry=0x20003938 <USBD1>, ep=ep@entry=0x2) at chibios/os/hal/ports/STM32/LLD/OTGv1/hal_usb_lld.c:443
#4  0x08011dbc in usb_lld_serve_interrupt (usbp=0x20003938 <USBD1>) at chibios/os/hal/ports/STM32/LLD/OTGv1/hal_usb_lld.c:625
#5  Vector14C () at chibios/os/hal/ports/STM32/LLD/OTGv1/hal_usb_lld.c:651
#6  <signal handler called>
#7  0x55555554 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

gdb$ up 2
#2  0x0801034a in sduDataReceived (usbp=<optimized out>, ep=<optimized out>) at chibios/os/hal/src/hal_serial_usb.c:467
467       ibqPostFullBufferI(&sdup->ibqueue,

gdb$ info locals
sdup = 0x200099c4 <board::usb::(anonymous namespace)::SDU1>

gdb$ print ((sdup->config->usbp)->epc[sdup->config->bulk_out]->out_state->rxcnt)
$1 = 0x0

gdb$ print *sdup
$2 = {
  vmt = 0x8049db0 <vmt>,
  event = {
    next = 0x2000e500 <serial::init(std::function<bool (char const*)> const&)::serial_thread+8704>
  },
  state = SDU_READY,
  ibqueue = {
    waiting = {
      next = 0x200099d0 <board::usb::(anonymous namespace)::SDU1+12>,
      prev = 0x200099d0 <board::usb::(anonymous namespace)::SDU1+12>
    },
    suspended = 0x0,
    bcounter = 0x0,
    bwrptr = 0x20009a40 <board::usb::(anonymous namespace)::SDU1+124> "\033",
    brdptr = 0x20009a40 <board::usb::(anonymous namespace)::SDU1+124> "\033",
    btop = 0x20009ac8 <board::usb::(anonymous namespace)::SDU1+260> "@",
    bsize = 0x44,
    bn = 0x2,
    buffers = 0x20009a40 <board::usb::(anonymous namespace)::SDU1+124> "\033",
    ptr = 0x0,
    top = 0x20009a8f <board::usb::(anonymous namespace)::SDU1+203> ".l.curr_mul",
    notify = 0x8010001 <ibnotify>,
    link = 0x200099c4 <board::usb::(anonymous namespace)::SDU1>
  },
  obqueue = {
    waiting = {
      next = 0x20009a08 <board::usb::(anonymous namespace)::SDU1+68>,
      prev = 0x20009a08 <board::usb::(anonymous namespace)::SDU1+68>
    },
    suspended = 0x0,
    bcounter = 0x1,
    bwrptr = 0x20009b0c <board::usb::(anonymous namespace)::SDU1+328> "@",
    brdptr = 0x20009ac8 <board::usb::(anonymous namespace)::SDU1+260> "@",
    btop = 0x20009b50 <board::usb::(anonymous namespace)::SDU1+396> "",
    bsize = 0x44,
    bn = 0x2,
    buffers = 0x20009ac8 <board::usb::(anonymous namespace)::SDU1+260> "@",
    ptr = 0x0,
    top = 0x20009b0c <board::usb::(anonymous namespace)::SDU1+328> "@",
    notify = 0x800ff61 <obnotify>,
    link = 0x200099c4 <board::usb::(anonymous namespace)::SDU1>
  },
  ib = "\033\000\000\000\216\001\000\020mid.phi.curr_mul\000\377K8xr\216\377\222\226_\365\216\b\216\020Ɏty\000\377\343Wv,\216", '\000' <repeats 17 times>, "\a\000\000\000\216\003\245\240-A\216.l.curr_mul\000\377\363\322\060\001\216.\216\000\377&\212\254\060\216on\000\377\214\307\327\215\216\216v,\216", '\000' <repeats 16 times>,
  ob = "@", '\000' <repeats 27 times>, "\002", '\000' <repeats 39 times>, "@\000\000\000\000ˌ\316\357\000\000\000\000\064\326\060 \026\000\000\000\003", '\000' <repeats 22 times>, "\001", '\000' <repeats 22 times>,
  config = 0x804cb00 <board::usb::(anonymous namespace)::serusbcfg1>
}

gdb$ print *(sdup->config->usbp)
$3 = {
  state = USB_ACTIVE,
  config = 0x804cc00 <board::usb::(anonymous namespace)::usbcfg>,
  transmitting = 0x4,
  receiving = 0x0,
  epc = {0x8049f20 <ep0config>, 0x804cc10 <board::usb::(anonymous namespace)::ep1config>, 0x804cc40 <board::usb::(anonymous namespace)::ep2config>, 0x0, 0x0, 0x0, 0x0},
  in_params = {0x200099c4 <board::usb::(anonymous namespace)::SDU1>, 0x200099c4 <board::usb::(anonymous namespace)::SDU1>, 0x0, 0x0, 0x0, 0x0},
  out_params = {0x0, 0x200099c4 <board::usb::(anonymous namespace)::SDU1>, 0x0, 0x0, 0x0, 0x0},
  ep0state = USB_EP0_WAITING_SETUP,
  ep0next = 0x20002100 <linecoding> "",
  ep0n = 0x7,
  ep0endcb = 0x0,
  setup = "! \000\000\000\000\a",
  status = 0x0,
  address = 0x30,
  configuration = 0x1,
  saved_state = USB_READY,
  otg = 0x50000000,
  otgparams = 0x8049f50 <fsparams>,
  pmnext = 0xa4,
  txpending = 0x4,
  wait = 0x0,
  tr = 0x20004b98 <USBD1+4704>,
  wa_pump = {0x5555555555555555 <repeats 534 times>, 0x20004b982000dbd8, 0x2000dad4200039c8, 0x800e191ffffffff, 0x810002000800c3ae, 0x5555555555555555, 0x5555555555555555, 0x2000387055555555, 0x10000000800c3e6, 0x5555555555555555, 0x5555555555555555, 0x800e35f55555555, 0x800da91, 0x17d4bce00000000, 0x5555555500000003, 0x5555555555555555, 0x5555555555555555, 0x5555555500000000, 0x5000000020003908, 0xc6000200060000, 0x800e96900000000, 0x810000000801117c, 0x5555555555555555, 0x5555555555555555, 0x2000387055555555, 0x5555555520003820, 0x5555555555555555, 0x5555555555555555, 0x800e35f55555555, 0x800da91, 0x801120100000000, 0x5000000000000008, 0x80110d155555555, 0x5555555520003938, 0x5555555555555555, 0x5555555555555555, 0x800c3cb55555555, 0x2000387020003870, 0x20004ae400000094, 0x200038202000dbd8, 0x200039c808049f60, 0x55010001, 0x20004bc0, 0x9400000000, 0x19c80000021f, 0x9eae01194d9, 0x75b278}
}


I did some more experimenting and noticed that the probability of the occurrence of the error varies unpredictably if I change the size of the buffers. Among the tested values, 2x128 bytes seemed to trigger the problem most often; in general, larger buffers seemed to be less likely to trigger the failure.

At this point, I suspected that sometimes the host might be emitting USB bulk data transfers with an empty payload, which may not be managed by the firmware properly, hence why the transfer size was zero in sduDataReceived(). So I decided to start Wireshark and collect two data dumps - one where the stack is known to function correctly, and the other where the failure is triggered. As it turned out, the host never emits empty bulk data transfer requests, so my first assumption has not been confirmed; however, the failure seems to correlate with the following interesting condition on the bus (also see the attached capture file):

Screenshot_20180321_133100.png


The captured fragment immediately precedes the failure. The line highlighted in black is the last response we got from the device, immediately after that it triggered the assertion trip. This segment is interesting because, as you can see from the timestamps, the device took 13 milliseconds (!) to process the write request, and the request-response sequence was interrupted by two other concurrent data-IN requests. A few tests show that the long request-response cycle combined with concurrent data-IN requests seem to reliably predict the failure.

The fact that the problem seems to correlate with concurrent requests suggest that there might be a race condition in the driver, causing usbGetReceiveTransactionSizeX() return incorrect (zero) transfer size?

Additionally, I noticed that the failure seems to be marginally less likely to occur while the USB traffic is monitored and the usbmon driver is loaded. This observation might support the race condition hypothesis unless I'm completely misunderstanding how things work under the hood.

Any help would be appreciated! Let me know if there's any additional data you'd like to have.
Attachments
failing.tar.gz
Wireshark dump
(40.08 KiB) Downloaded 233 times

User avatar
Giovanni
Site Admin
Posts: 14444
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1074 times
Been thanked: 921 times
Contact:

Re: USB Serial driver is misbehaving?

Postby Giovanni » Wed Mar 21, 2018 2:43 pm

Hi,

I can give this some time during weekend but it would be better to have something to test and debug. Are you able to reproduce the problem in a simple test application?

Other things I would try:
- 16.7 is old, could you try 18.2.0? the USB driver changed in 2 years.
- Have you considered to not use the serial_usb emulation but directly the USB API? there are demos about this, you would skip entirely the whole buffering thing.

Giovanni

User avatar
Spym
Posts: 45
Joined: Tue Oct 15, 2013 10:20 am
Location: Tallinn, Estonia
Has thanked: 2 times
Been thanked: 3 times
Contact:

Re: USB Serial driver is misbehaving?

Postby Spym » Wed Mar 21, 2018 2:56 pm

So I just tried the following hack to work-around the problem in the easiest and dirtiest way possible:

Code: Select all

static void sduDataReceivedProxy(USBDriver* const usbp, const usbep_t ep)
{
    if (auto sdup = static_cast<SerialUSBDriver*>(usbp->out_params[ep - 1U]))
    {
        if (usbGetReceiveTransactionSizeX(sdup->config->usbp, sdup->config->bulk_out) > 0)
        {
            sduDataReceived(usbp, ep);
        }
    }
}


The objective was to prevent the firmware from crashing and see how far we can get. As it turns out, the USB stack falls into an unusable state after the first glitch is observed - after the first glitch, the device takes up to 6 (six!) seconds to respond to any data-OUT request:

Screenshot_20180321_155250.png


The stack seems to fully recover after the interface is re-connected. This makes it look like something (perhaps, concurrent requests?) damages the internal state of the driver, and it takes a reinitialization (reconnection) to bring it back to life.

User avatar
Spym
Posts: 45
Joined: Tue Oct 15, 2013 10:20 am
Location: Tallinn, Estonia
Has thanked: 2 times
Been thanked: 3 times
Contact:

Re: USB Serial driver is misbehaving?

Postby Spym » Wed Mar 21, 2018 2:58 pm

Hi Giovanni,

Sorry, didn't see your reply when I wrote the previous post. I think I should be able to put together a simple MWE, but it's going to be configured for my hardware - I hope that's acceptable. If you have any other ideas (I'd rather not update the OS right before the release, unless it's absolutely necessary), please share.

Thank you!

User avatar
Giovanni
Site Admin
Posts: 14444
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1074 times
Been thanked: 921 times
Contact:

Re: USB Serial driver is misbehaving?

Postby Giovanni » Wed Mar 21, 2018 3:32 pm

I need something to debug, could you make it run on a nucleo board?

We don't need the whole thing, just a minimal demonstrator of the issue.

Giovanni

User avatar
Spym
Posts: 45
Joined: Tue Oct 15, 2013 10:20 am
Location: Tallinn, Estonia
Has thanked: 2 times
Been thanked: 3 times
Contact:

Re: USB Serial driver is misbehaving?

Postby Spym » Wed Mar 21, 2018 5:53 pm

Giovanni,

I just rummaged through the shelves in the lab and found no commonly used STM32 demoboards, not even half a Nucleo. However, while in search of a demoboard, I stumbled upon our other successful USB-enabled products based on ChibiOS 16 and various flavors of STM32, namely STM32F373, STM32F446 (same as above), and STM32F105. I decided to reproduce this problem on them as well, and, to my utter horror, I observed exactly the same glitch on all of them. None of the three devices exhibit this error when used with a non-virtualized OS, all of them fail with a guest OS.

Some of the devices are open source; for example, this is the one based on STM32F373: https://github.com/Zubax/zubax_babel.

The evidence I was able to collect up to this point leads me to believe that the error is not specific to my applications. Therefore, you should be able to reproduce it on the demo applications distributed with ChibiOS as well, using whatever hardware you have available at hand. I recommend the following recipe:

1. Boot a virtual machine. In my case, as has been said above, I'm using VirtualBox with Ubuntu 16.04 AMD64. I am currently uploading the exact image I am using here: https://files.zubax.com/vm/vm.ova (at the time of writing this the link returns 404, it should be up soon).
2. Connect your demo application's USB interface (make sure that assertions and system state checks are enabled) to the VM.
3. Generate some traffic in both directions:

Code: Select all

stty -F /dev/ttyACM1 9600
while true; do cat /dev/ttyACM1; done

Code: Select all

cat /dev/urandom > /dev/ttyACM1

4. Sit back and wait a few minutes.

Please let me know how it went.

User avatar
Spym
Posts: 45
Joined: Tue Oct 15, 2013 10:20 am
Location: Tallinn, Estonia
Has thanked: 2 times
Been thanked: 3 times
Contact:

Re: USB Serial driver is misbehaving?

Postby Spym » Wed Mar 21, 2018 7:34 pm

The VM archive mentioned in the previous post is now available: https://files.zubax.com/vm/vm.ova

The workaround that I posted here eariler was incorrect. Here's a fixed edition:

Code: Select all

static void sduDataReceivedProxy(USBDriver* const usbp, const usbep_t ep)
{
    if (auto sdup = static_cast<SerialUSBDriver*>(usbp->out_params[ep - 1U]))
    {
        if (usbGetReceiveTransactionSizeX(sdup->config->usbp, sdup->config->bulk_out) > 0)
        {
            // The transaction is not empty
            sduDataReceived(usbp, ep);
        }
        else
        {
            // The code below mimics sduDataReceived() except that it doesn't post the buffer (it would be empty).
            chibios_rt::System::lockFromIsr();
            if (uint8_t* const buf = ibqGetEmptyBufferI(&sdup->ibqueue))
            {
                usbStartReceiveI(sdup->config->usbp, sdup->config->bulk_out, buf, SERIAL_USB_BUFFERS_SIZE);
            }
            chibios_rt::System::unlockFromIsr();
        }
    }
}


The updated workaround seems to tolerate empty transfers well (not breaking the interface most of the times); however, it is still unclear why we're getting them in the first place.

User avatar
Giovanni
Site Admin
Posts: 14444
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1074 times
Been thanked: 921 times
Contact:

Re: USB Serial driver is misbehaving?

Postby Giovanni » Wed Mar 21, 2018 9:36 pm

I think it is something about zero length packets. Could you test 18.2.0 using one of the provided demos? setting up VMs and all will take some time for me.

Giovanni

User avatar
Giovanni
Site Admin
Posts: 14444
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1074 times
Been thanked: 921 times
Contact:

Re: USB Serial driver is misbehaving?

Postby Giovanni » Sun Mar 25, 2018 9:43 am

Hi,

usbGetReceiveTransactionSizeX() returns the transaction size that is the "cnt" field associated to the endpoint, I don't think it is about "concurrent" transactions because that field is private to the endpoint, transactions on other endpoints should not affect it.

I think it is a zero size packet that the host sends immediately after the OUT transaction has been started, this results in a transaction size of zero which triggers that assertion.

If this is the case then a transaction with size zero should be considered valid and that assertion is wrong, I need to understand how to handle this.

Could you give a try to the demo STM32F7xx-USB_RAW? that does not use buffering and it should not be affected by this. You should be able to port it easily to other devices. If it is unaffected then the driver is OK, the problem should be handled entirely on the buffering side.

Giovanni

User avatar
Giovanni
Site Admin
Posts: 14444
Joined: Wed May 27, 2009 8:48 am
Location: Salerno, Italy
Has thanked: 1074 times
Been thanked: 921 times
Contact:

Re: USB Serial driver is misbehaving?

Postby Giovanni » Sun Mar 25, 2018 12:55 pm

Hi,

Please try this fix in hal_serial_usb.c:

Code: Select all

void sduDataReceived(USBDriver *usbp, usbep_t ep) {
  size_t size;
  SerialUSBDriver *sdup = usbp->out_params[ep - 1U];

  if (sdup == NULL) {
    return;
  }

  osalSysLockFromISR();

  /* Checking for zero-size transactions.*/
  size = usbGetReceiveTransactionSizeX(sdup->config->usbp,
                                       sdup->config->bulk_out);
  if (size > (size_t)0) {
    /* Signaling that data is available in the input queue.*/
    chnAddFlagsI(sdup, CHN_INPUT_AVAILABLE);

    /* Posting the filled buffer in the queue.*/
    ibqPostFullBufferI(&sdup->ibqueue, size);
  }

  /* The endpoint cannot be busy, we are in the context of the callback,
     so a packet is in the buffer for sure. Trying to get a free buffer
     for the next transaction.*/
  (void) sdu_start_receive(sdup);

  osalSysUnlockFromISR();
}


Giovanni


Return to “ChibiOS/HAL”

Who is online

Users browsing this forum: No registered users and 8 guests