Hi Alain,

On Wed, Jul 15, 2020 at 8:10 AM Alain Michaud <alainmich...@google.com> wrote:
>
> Resending in plain text.

I've sent a RFC to work out the ordering, that should work out for any
race where it process ACL before Events during a polling interval
(1ms) so I hope that is enough to catch all these races, if that is
not perhaps we could make the interval configurable.

>
> On Wed, Jul 15, 2020 at 9:56 AM Alain Michaud <alainmich...@google.com> wrote:
> >
> > Hi Marcel,
> >
> > Sorry, just got around to this.
> >
> > On Tue, Jun 30, 2020 at 2:55 AM Marcel Holtmann <mar...@holtmann.org> wrote:
> >>
> >> Hi Archie,
> >>
> >> >>> There is a possibility that an ACL packet is received before we
> >> >>> receive the HCI connect event for the corresponding handle. If this
> >> >>> happens, we discard the ACL packet.
> >> >>>
> >> >>> Rather than just ignoring them, this patch provides a queue for
> >> >>> incoming ACL packet without a handle. The queue is processed when
> >> >>> receiving a HCI connection event. If 2 seconds elapsed without
> >> >>> receiving the HCI connection event, assume something bad happened
> >> >>> and discard the queued packet.
> >> >>>
> >> >>> Signed-off-by: Archie Pusaka <apus...@chromium.org>
> >> >>> Reviewed-by: Abhishek Pandit-Subedi <abhishekpan...@chromium.org>
> >> >>
> >> >> so two things up front. I want to hide this behind a 
> >> >> HCI_QUIRK_OUT_OF_ORDER_ACL that a transport driver has to set first. 
> >> >> Frankly if this kind of out-of-order happens on UART or SDIO 
> >> >> transports, then something is obviously going wrong. I have no plan to 
> >> >> fix up after a fully serialized transport.
> >> >>
> >> >> Secondly, if a transport sets HCI_QUIRK_OUT_OF_ORDER_ACL, then I want 
> >> >> this off by default. You can enable it via an experimental setting. The 
> >> >> reason here is that we have to make it really hard and fail as often as 
> >> >> possible so that hardware manufactures and spec writers realize that 
> >> >> something is fundamentally broken here.
> >
> > I don't have any objection to making this explicit enable to non serialized 
> > transports.  However, I do wonder what the intention is around making this 
> > off by default.  We already know there is a race condition between the 
> > interupt and bulk endpoints over USB, so this can and does happen.  
> > Hardware manufaturers can't relly do much about this other than trying to 
> > pull the interupt endpoint more often, but that's only a workaround, it 
> > can't avoid it all together.
> >
> > IMO, this seems like a legitimate fix at the host level and I don't see any 
> > obvious benefits to hide this fix under an experimental feature and make it 
> > more difficult for the customers and system integrators to discover.
> >
> >>
> >> >>
> >> >> I have no problem in running the code and complaining loudly in case 
> >> >> the quirk has been set. Just injecting the packets can only happen if 
> >> >> bluetoothd explicitly enabled it.
> >> >
> >> > Got it.
> >> >
> >> >>
> >> >>
> >> >>>
> >> >>> ---
> >> >>>
> >> >>> include/net/bluetooth/hci_core.h |  8 +++
> >> >>> net/bluetooth/hci_core.c         | 84 +++++++++++++++++++++++++++++---
> >> >>> net/bluetooth/hci_event.c        |  2 +
> >> >>> 3 files changed, 88 insertions(+), 6 deletions(-)
> >> >>>
> >> >>> diff --git a/include/net/bluetooth/hci_core.h 
> >> >>> b/include/net/bluetooth/hci_core.h
> >> >>> index 836dc997ff94..b69ecdd0d15a 100644
> >> >>> --- a/include/net/bluetooth/hci_core.h
> >> >>> +++ b/include/net/bluetooth/hci_core.h
> >> >>> @@ -270,6 +270,9 @@ struct adv_monitor {
> >> >>> /* Default authenticated payload timeout 30s */
> >> >>> #define DEFAULT_AUTH_PAYLOAD_TIMEOUT   0x0bb8
> >> >>>
> >> >>> +/* Time to keep ACL packets without a corresponding handle queued 
> >> >>> (2s) */
> >> >>> +#define PENDING_ACL_TIMEOUT          msecs_to_jiffies(2000)
> >> >>> +
> >> >>
> >> >> Do we have some btmon traces with timestamps. Isn’t a second enough? 
> >> >> Actually 2 seconds is an awful long time.
> >> >
> >> > When this happens in the test lab, the HCI connect event is about
> >> > 0.002 second behind the first ACL packet. We can change this if
> >> > required.
> >> >
> >> >>
> >> >>> struct amp_assoc {
> >> >>>      __u16   len;
> >> >>>      __u16   offset;
> >> >>> @@ -538,6 +541,9 @@ struct hci_dev {
> >> >>>      struct delayed_work     rpa_expired;
> >> >>>      bdaddr_t                rpa;
> >> >>>
> >> >>> +     struct delayed_work     remove_pending_acl;
> >> >>> +     struct sk_buff_head     pending_acl_q;
> >> >>> +
> >> >>
> >> >> can we name this ooo_q and move it to the other queues in this struct. 
> >> >> Unless we want to add a Kconfig option around it, we don’t need to keep 
> >> >> it here.
> >> >
> >> > Ack.
> >> >
> >> >>
> >> >>> #if IS_ENABLED(CONFIG_BT_LEDS)
> >> >>>      struct led_trigger      *power_led;
> >> >>> #endif
> >> >>> @@ -1773,6 +1779,8 @@ void hci_le_start_enc(struct hci_conn *conn, 
> >> >>> __le16 ediv, __le64 rand,
> >> >>> void hci_copy_identity_address(struct hci_dev *hdev, bdaddr_t *bdaddr,
> >> >>>                             u8 *bdaddr_type);
> >> >>>
> >> >>> +void hci_process_pending_acl(struct hci_dev *hdev, struct hci_conn 
> >> >>> *conn);
> >> >>> +
> >> >>> #define SCO_AIRMODE_MASK       0x0003
> >> >>> #define SCO_AIRMODE_CVSD       0x0000
> >> >>> #define SCO_AIRMODE_TRANSP     0x0003
> >> >>> diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
> >> >>> index 7959b851cc63..30780242c267 100644
> >> >>> --- a/net/bluetooth/hci_core.c
> >> >>> +++ b/net/bluetooth/hci_core.c
> >> >>> @@ -1786,6 +1786,7 @@ int hci_dev_do_close(struct hci_dev *hdev)
> >> >>>      skb_queue_purge(&hdev->rx_q);
> >> >>>      skb_queue_purge(&hdev->cmd_q);
> >> >>>      skb_queue_purge(&hdev->raw_q);
> >> >>> +     skb_queue_purge(&hdev->pending_acl_q);
> >> >>>
> >> >>>      /* Drop last sent command */
> >> >>>      if (hdev->sent_cmd) {
> >> >>> @@ -3518,6 +3519,78 @@ static int hci_suspend_notifier(struct 
> >> >>> notifier_block *nb, unsigned long action,
> >> >>>      return NOTIFY_STOP;
> >> >>> }
> >> >>>
> >> >>> +static void hci_add_pending_acl(struct hci_dev *hdev, struct sk_buff 
> >> >>> *skb)
> >> >>> +{
> >> >>> +     skb_queue_tail(&hdev->pending_acl_q, skb);
> >> >>> +
> >> >>> +     queue_delayed_work(hdev->workqueue, &hdev->remove_pending_acl,
> >> >>> +                        PENDING_ACL_TIMEOUT);
> >> >>> +}
> >> >>> +
> >> >>> +void hci_process_pending_acl(struct hci_dev *hdev, struct hci_conn 
> >> >>> *conn)
> >> >>> +{
> >> >>> +     struct sk_buff *skb, *tmp;
> >> >>> +     struct hci_acl_hdr *hdr;
> >> >>> +     u16 handle, flags;
> >> >>> +     bool reset_timer = false;
> >> >>> +
> >> >>> +     skb_queue_walk_safe(&hdev->pending_acl_q, skb, tmp) {
> >> >>> +             hdr = (struct hci_acl_hdr *)skb->data;
> >> >>> +             handle = __le16_to_cpu(hdr->handle);
> >> >>> +             flags  = hci_flags(handle);
> >> >>> +             handle = hci_handle(handle);
> >> >>> +
> >> >>> +             if (handle != conn->handle)
> >> >>> +                     continue;
> >> >>> +
> >> >>> +             __skb_unlink(skb, &hdev->pending_acl_q);
> >> >>> +             skb_pull(skb, HCI_ACL_HDR_SIZE);
> >> >>> +
> >> >>> +             l2cap_recv_acldata(conn, skb, flags);
> >> >>> +             reset_timer = true;
> >> >>> +     }
> >> >>> +
> >> >>> +     if (reset_timer)
> >> >>> +             mod_delayed_work(hdev->workqueue, 
> >> >>> &hdev->remove_pending_acl,
> >> >>> +                              PENDING_ACL_TIMEOUT);
> >> >>> +}
> >> >>> +
> >> >>> +/* Remove the oldest pending ACL, and all pending ACLs with the same 
> >> >>> handle */
> >> >>> +static void hci_remove_pending_acl(struct work_struct *work)
> >> >>> +{
> >> >>> +     struct hci_dev *hdev;
> >> >>> +     struct sk_buff *skb, *tmp;
> >> >>> +     struct hci_acl_hdr *hdr;
> >> >>> +     u16 handle, oldest_handle;
> >> >>> +
> >> >>> +     hdev = container_of(work, struct hci_dev, 
> >> >>> remove_pending_acl.work);
> >> >>> +     skb = skb_dequeue(&hdev->pending_acl_q);
> >> >>> +
> >> >>> +     if (!skb)
> >> >>> +             return;
> >> >>> +
> >> >>> +     hdr = (struct hci_acl_hdr *)skb->data;
> >> >>> +     oldest_handle = hci_handle(__le16_to_cpu(hdr->handle));
> >> >>> +     kfree_skb(skb);
> >> >>> +
> >> >>> +     bt_dev_err(hdev, "ACL packet for unknown connection handle %d",
> >> >>> +                oldest_handle);
> >> >>> +
> >> >>> +     skb_queue_walk_safe(&hdev->pending_acl_q, skb, tmp) {
> >> >>> +             hdr = (struct hci_acl_hdr *)skb->data;
> >> >>> +             handle = hci_handle(__le16_to_cpu(hdr->handle));
> >> >>> +
> >> >>> +             if (handle == oldest_handle) {
> >> >>> +                     __skb_unlink(skb, &hdev->pending_acl_q);
> >> >>> +                     kfree_skb(skb);
> >> >>> +             }
> >> >>> +     }
> >> >>> +
> >> >>> +     if (!skb_queue_empty(&hdev->pending_acl_q))
> >> >>> +             queue_delayed_work(hdev->workqueue, 
> >> >>> &hdev->remove_pending_acl,
> >> >>> +                                PENDING_ACL_TIMEOUT);
> >> >>> +}
> >> >>> +
> >> >>
> >> >> So I am wondering if we make this too complicated. Since generally 
> >> >> speaking we can only have a single HCI connect complete anyway at a 
> >> >> time. No matter if the controller serializes it for us or we do it for 
> >> >> the controller. So hci_conn_add could just process the queue for 
> >> >> packets with its handle and then flush it. And it can flush it no 
> >> >> matter what since whatever other packets are in the queue, they can not 
> >> >> be valid.
> >> >>
> >> >> That said, we wouldn’t even need to check the packet handles at all. We 
> >> >> just needed to flag them as already out-of-order queued once and hand 
> >> >> them back into the rx_q at the top. Then the would be processed as 
> >> >> usual. Already ooo packets would cause the same error as before if it 
> >> >> is for a non-existing handle and others would end up being processed.
> >> >>
> >> >> For me this means we just need another queue to park the packets until 
> >> >> hci_conn_add gets called. I might have missed something, but I am 
> >> >> looking for the least invasive option for this and least code 
> >> >> duplication.
> >> >
> >> > I'm not aware of the fact that we can only have a single HCI connect
> >> > complete event at any time. Is this also true even if two / more
> >> > peripherals connect at the same time?
> >> > I was under the impression that if we have device A and B both are
> >> > connecting to us at the same time, we might receive the packets in
> >> > this order:
> >> > (1) ACL A
> >> > (2) ACL B
> >> > (3) HCI conn evt B
> >> > (4) HCI conn evt A
> >> > Hence the queue and the handle check.
> >>
> >> my reading from the LL state machine is that once the first LL_Connect_Req 
> >> is processes, the controller moves out of the advertising state. So no 
> >> other LL_Connect_Req can be processed. So that means that connection 
> >> attempts are serialized.
> >>
> >> Now if you run AE and multiple instances, that might be different, but 
> >> then again, these instances are also offset in time and so I don’t see how 
> >> we can get more than one HCI_Connection_Complete event at a time (and with 
> >> that a leading ACL packet).
> >>
> >> Regards
> >>
> >> Marcel
> >>
> >> --
> >> You received this message because you are subscribed to the Google Groups 
> >> "ChromeOS Bluetooth Upstreaming" group.
> >> To unsubscribe from this group and stop receiving emails from it, send an 
> >> email to chromeos-bluetooth-upstreaming+unsubscr...@chromium.org.
> >> To post to this group, send email to 
> >> chromeos-bluetooth-upstream...@chromium.org.
> >> To view this discussion on the web visit 
> >> https://groups.google.com/a/chromium.org/d/msgid/chromeos-bluetooth-upstreaming/7BBB55E0-FBD9-40C0-80D9-D5E7FC9F80D2%40holtmann.org.



-- 
Luiz Augusto von Dentz

Reply via email to