Hi Alain, On Wed, Jul 15, 2020 at 8:10 AM Alain Michaud <alainmich...@google.com> wrote: > > Resending in plain text.
I've sent a RFC to work out the ordering, that should work out for any race where it process ACL before Events during a polling interval (1ms) so I hope that is enough to catch all these races, if that is not perhaps we could make the interval configurable. > > On Wed, Jul 15, 2020 at 9:56 AM Alain Michaud <alainmich...@google.com> wrote: > > > > Hi Marcel, > > > > Sorry, just got around to this. > > > > On Tue, Jun 30, 2020 at 2:55 AM Marcel Holtmann <mar...@holtmann.org> wrote: > >> > >> Hi Archie, > >> > >> >>> There is a possibility that an ACL packet is received before we > >> >>> receive the HCI connect event for the corresponding handle. If this > >> >>> happens, we discard the ACL packet. > >> >>> > >> >>> Rather than just ignoring them, this patch provides a queue for > >> >>> incoming ACL packet without a handle. The queue is processed when > >> >>> receiving a HCI connection event. If 2 seconds elapsed without > >> >>> receiving the HCI connection event, assume something bad happened > >> >>> and discard the queued packet. > >> >>> > >> >>> Signed-off-by: Archie Pusaka <apus...@chromium.org> > >> >>> Reviewed-by: Abhishek Pandit-Subedi <abhishekpan...@chromium.org> > >> >> > >> >> so two things up front. I want to hide this behind a > >> >> HCI_QUIRK_OUT_OF_ORDER_ACL that a transport driver has to set first. > >> >> Frankly if this kind of out-of-order happens on UART or SDIO > >> >> transports, then something is obviously going wrong. I have no plan to > >> >> fix up after a fully serialized transport. > >> >> > >> >> Secondly, if a transport sets HCI_QUIRK_OUT_OF_ORDER_ACL, then I want > >> >> this off by default. You can enable it via an experimental setting. The > >> >> reason here is that we have to make it really hard and fail as often as > >> >> possible so that hardware manufactures and spec writers realize that > >> >> something is fundamentally broken here. > > > > I don't have any objection to making this explicit enable to non serialized > > transports. However, I do wonder what the intention is around making this > > off by default. We already know there is a race condition between the > > interupt and bulk endpoints over USB, so this can and does happen. > > Hardware manufaturers can't relly do much about this other than trying to > > pull the interupt endpoint more often, but that's only a workaround, it > > can't avoid it all together. > > > > IMO, this seems like a legitimate fix at the host level and I don't see any > > obvious benefits to hide this fix under an experimental feature and make it > > more difficult for the customers and system integrators to discover. > > > >> > >> >> > >> >> I have no problem in running the code and complaining loudly in case > >> >> the quirk has been set. Just injecting the packets can only happen if > >> >> bluetoothd explicitly enabled it. > >> > > >> > Got it. > >> > > >> >> > >> >> > >> >>> > >> >>> --- > >> >>> > >> >>> include/net/bluetooth/hci_core.h | 8 +++ > >> >>> net/bluetooth/hci_core.c | 84 +++++++++++++++++++++++++++++--- > >> >>> net/bluetooth/hci_event.c | 2 + > >> >>> 3 files changed, 88 insertions(+), 6 deletions(-) > >> >>> > >> >>> diff --git a/include/net/bluetooth/hci_core.h > >> >>> b/include/net/bluetooth/hci_core.h > >> >>> index 836dc997ff94..b69ecdd0d15a 100644 > >> >>> --- a/include/net/bluetooth/hci_core.h > >> >>> +++ b/include/net/bluetooth/hci_core.h > >> >>> @@ -270,6 +270,9 @@ struct adv_monitor { > >> >>> /* Default authenticated payload timeout 30s */ > >> >>> #define DEFAULT_AUTH_PAYLOAD_TIMEOUT 0x0bb8 > >> >>> > >> >>> +/* Time to keep ACL packets without a corresponding handle queued > >> >>> (2s) */ > >> >>> +#define PENDING_ACL_TIMEOUT msecs_to_jiffies(2000) > >> >>> + > >> >> > >> >> Do we have some btmon traces with timestamps. Isn’t a second enough? > >> >> Actually 2 seconds is an awful long time. > >> > > >> > When this happens in the test lab, the HCI connect event is about > >> > 0.002 second behind the first ACL packet. We can change this if > >> > required. > >> > > >> >> > >> >>> struct amp_assoc { > >> >>> __u16 len; > >> >>> __u16 offset; > >> >>> @@ -538,6 +541,9 @@ struct hci_dev { > >> >>> struct delayed_work rpa_expired; > >> >>> bdaddr_t rpa; > >> >>> > >> >>> + struct delayed_work remove_pending_acl; > >> >>> + struct sk_buff_head pending_acl_q; > >> >>> + > >> >> > >> >> can we name this ooo_q and move it to the other queues in this struct. > >> >> Unless we want to add a Kconfig option around it, we don’t need to keep > >> >> it here. > >> > > >> > Ack. > >> > > >> >> > >> >>> #if IS_ENABLED(CONFIG_BT_LEDS) > >> >>> struct led_trigger *power_led; > >> >>> #endif > >> >>> @@ -1773,6 +1779,8 @@ void hci_le_start_enc(struct hci_conn *conn, > >> >>> __le16 ediv, __le64 rand, > >> >>> void hci_copy_identity_address(struct hci_dev *hdev, bdaddr_t *bdaddr, > >> >>> u8 *bdaddr_type); > >> >>> > >> >>> +void hci_process_pending_acl(struct hci_dev *hdev, struct hci_conn > >> >>> *conn); > >> >>> + > >> >>> #define SCO_AIRMODE_MASK 0x0003 > >> >>> #define SCO_AIRMODE_CVSD 0x0000 > >> >>> #define SCO_AIRMODE_TRANSP 0x0003 > >> >>> diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c > >> >>> index 7959b851cc63..30780242c267 100644 > >> >>> --- a/net/bluetooth/hci_core.c > >> >>> +++ b/net/bluetooth/hci_core.c > >> >>> @@ -1786,6 +1786,7 @@ int hci_dev_do_close(struct hci_dev *hdev) > >> >>> skb_queue_purge(&hdev->rx_q); > >> >>> skb_queue_purge(&hdev->cmd_q); > >> >>> skb_queue_purge(&hdev->raw_q); > >> >>> + skb_queue_purge(&hdev->pending_acl_q); > >> >>> > >> >>> /* Drop last sent command */ > >> >>> if (hdev->sent_cmd) { > >> >>> @@ -3518,6 +3519,78 @@ static int hci_suspend_notifier(struct > >> >>> notifier_block *nb, unsigned long action, > >> >>> return NOTIFY_STOP; > >> >>> } > >> >>> > >> >>> +static void hci_add_pending_acl(struct hci_dev *hdev, struct sk_buff > >> >>> *skb) > >> >>> +{ > >> >>> + skb_queue_tail(&hdev->pending_acl_q, skb); > >> >>> + > >> >>> + queue_delayed_work(hdev->workqueue, &hdev->remove_pending_acl, > >> >>> + PENDING_ACL_TIMEOUT); > >> >>> +} > >> >>> + > >> >>> +void hci_process_pending_acl(struct hci_dev *hdev, struct hci_conn > >> >>> *conn) > >> >>> +{ > >> >>> + struct sk_buff *skb, *tmp; > >> >>> + struct hci_acl_hdr *hdr; > >> >>> + u16 handle, flags; > >> >>> + bool reset_timer = false; > >> >>> + > >> >>> + skb_queue_walk_safe(&hdev->pending_acl_q, skb, tmp) { > >> >>> + hdr = (struct hci_acl_hdr *)skb->data; > >> >>> + handle = __le16_to_cpu(hdr->handle); > >> >>> + flags = hci_flags(handle); > >> >>> + handle = hci_handle(handle); > >> >>> + > >> >>> + if (handle != conn->handle) > >> >>> + continue; > >> >>> + > >> >>> + __skb_unlink(skb, &hdev->pending_acl_q); > >> >>> + skb_pull(skb, HCI_ACL_HDR_SIZE); > >> >>> + > >> >>> + l2cap_recv_acldata(conn, skb, flags); > >> >>> + reset_timer = true; > >> >>> + } > >> >>> + > >> >>> + if (reset_timer) > >> >>> + mod_delayed_work(hdev->workqueue, > >> >>> &hdev->remove_pending_acl, > >> >>> + PENDING_ACL_TIMEOUT); > >> >>> +} > >> >>> + > >> >>> +/* Remove the oldest pending ACL, and all pending ACLs with the same > >> >>> handle */ > >> >>> +static void hci_remove_pending_acl(struct work_struct *work) > >> >>> +{ > >> >>> + struct hci_dev *hdev; > >> >>> + struct sk_buff *skb, *tmp; > >> >>> + struct hci_acl_hdr *hdr; > >> >>> + u16 handle, oldest_handle; > >> >>> + > >> >>> + hdev = container_of(work, struct hci_dev, > >> >>> remove_pending_acl.work); > >> >>> + skb = skb_dequeue(&hdev->pending_acl_q); > >> >>> + > >> >>> + if (!skb) > >> >>> + return; > >> >>> + > >> >>> + hdr = (struct hci_acl_hdr *)skb->data; > >> >>> + oldest_handle = hci_handle(__le16_to_cpu(hdr->handle)); > >> >>> + kfree_skb(skb); > >> >>> + > >> >>> + bt_dev_err(hdev, "ACL packet for unknown connection handle %d", > >> >>> + oldest_handle); > >> >>> + > >> >>> + skb_queue_walk_safe(&hdev->pending_acl_q, skb, tmp) { > >> >>> + hdr = (struct hci_acl_hdr *)skb->data; > >> >>> + handle = hci_handle(__le16_to_cpu(hdr->handle)); > >> >>> + > >> >>> + if (handle == oldest_handle) { > >> >>> + __skb_unlink(skb, &hdev->pending_acl_q); > >> >>> + kfree_skb(skb); > >> >>> + } > >> >>> + } > >> >>> + > >> >>> + if (!skb_queue_empty(&hdev->pending_acl_q)) > >> >>> + queue_delayed_work(hdev->workqueue, > >> >>> &hdev->remove_pending_acl, > >> >>> + PENDING_ACL_TIMEOUT); > >> >>> +} > >> >>> + > >> >> > >> >> So I am wondering if we make this too complicated. Since generally > >> >> speaking we can only have a single HCI connect complete anyway at a > >> >> time. No matter if the controller serializes it for us or we do it for > >> >> the controller. So hci_conn_add could just process the queue for > >> >> packets with its handle and then flush it. And it can flush it no > >> >> matter what since whatever other packets are in the queue, they can not > >> >> be valid. > >> >> > >> >> That said, we wouldn’t even need to check the packet handles at all. We > >> >> just needed to flag them as already out-of-order queued once and hand > >> >> them back into the rx_q at the top. Then the would be processed as > >> >> usual. Already ooo packets would cause the same error as before if it > >> >> is for a non-existing handle and others would end up being processed. > >> >> > >> >> For me this means we just need another queue to park the packets until > >> >> hci_conn_add gets called. I might have missed something, but I am > >> >> looking for the least invasive option for this and least code > >> >> duplication. > >> > > >> > I'm not aware of the fact that we can only have a single HCI connect > >> > complete event at any time. Is this also true even if two / more > >> > peripherals connect at the same time? > >> > I was under the impression that if we have device A and B both are > >> > connecting to us at the same time, we might receive the packets in > >> > this order: > >> > (1) ACL A > >> > (2) ACL B > >> > (3) HCI conn evt B > >> > (4) HCI conn evt A > >> > Hence the queue and the handle check. > >> > >> my reading from the LL state machine is that once the first LL_Connect_Req > >> is processes, the controller moves out of the advertising state. So no > >> other LL_Connect_Req can be processed. So that means that connection > >> attempts are serialized. > >> > >> Now if you run AE and multiple instances, that might be different, but > >> then again, these instances are also offset in time and so I don’t see how > >> we can get more than one HCI_Connection_Complete event at a time (and with > >> that a leading ACL packet). > >> > >> Regards > >> > >> Marcel > >> > >> -- > >> You received this message because you are subscribed to the Google Groups > >> "ChromeOS Bluetooth Upstreaming" group. > >> To unsubscribe from this group and stop receiving emails from it, send an > >> email to chromeos-bluetooth-upstreaming+unsubscr...@chromium.org. > >> To post to this group, send email to > >> chromeos-bluetooth-upstream...@chromium.org. > >> To view this discussion on the web visit > >> https://groups.google.com/a/chromium.org/d/msgid/chromeos-bluetooth-upstreaming/7BBB55E0-FBD9-40C0-80D9-D5E7FC9F80D2%40holtmann.org. -- Luiz Augusto von Dentz