Hi, On 15.02.2023 16:41, Jeffrey Hugo wrote: > On 2/14/2023 4:08 AM, Jacek Lawrynowicz wrote: >> Hi, > > Thank you for the review. > >> On 06.02.2023 16:41, Jeffrey Hugo wrote: >>> The Qualcomm Cloud AI 100 (AIC100) device is an Artificial Intelligence >>> accelerator PCIe card. It contains a number of components both in the >>> SoC and on the card which facilitate running workloads: >>> >>> QSM: management processor >>> NSPs: workload compute units >>> DMA Bridge: dedicated data mover for the workloads >>> MHI: multiplexed communication channels >>> DDR: workload storage and memory >>> >>> The Linux kernel driver for AIC100 is called "QAIC" and is located in the >>> accel subsystem. >>> >>> Signed-off-by: Jeffrey Hugo <[email protected]> >>> Reviewed-by: Carl Vanderlip <[email protected]> >>> --- >>> Documentation/accel/index.rst | 1 + >>> Documentation/accel/qaic/aic100.rst | 498 >>> ++++++++++++++++++++++++++++++++++++ >>> Documentation/accel/qaic/index.rst | 13 + >>> Documentation/accel/qaic/qaic.rst | 169 ++++++++++++ >>> 4 files changed, 681 insertions(+) >>> create mode 100644 Documentation/accel/qaic/aic100.rst >>> create mode 100644 Documentation/accel/qaic/index.rst >>> create mode 100644 Documentation/accel/qaic/qaic.rst >>> >>> diff --git a/Documentation/accel/index.rst b/Documentation/accel/index.rst >>> index 2b43c9a..e94a016 100644 >>> --- a/Documentation/accel/index.rst >>> +++ b/Documentation/accel/index.rst >>> @@ -8,6 +8,7 @@ Compute Accelerators >>> :maxdepth: 1 >>> introduction >>> + qaic/index >>> .. only:: subproject and html >>> diff --git a/Documentation/accel/qaic/aic100.rst >>> b/Documentation/accel/qaic/aic100.rst >>> new file mode 100644 >>> index 0000000..773aa54 >>> --- /dev/null >>> +++ b/Documentation/accel/qaic/aic100.rst >>> @@ -0,0 +1,498 @@ >>> +.. SPDX-License-Identifier: GPL-2.0-only >>> + >>> +=============================== >>> + Qualcomm Cloud AI 100 (AIC100) >>> +=============================== >>> + >>> +Overview >>> +======== >>> + >>> +The Qualcomm Cloud AI 100/AIC100 family of products (including SA9000P - >>> part of >>> +Snapdragon Ride) are PCIe adapter cards which contain a dedicated SoC ASIC >>> for >>> +the purpose of efficiently running Artificial Intelligence (AI) Deep >>> Learning >>> +inference workloads. They are AI accelerators. >> >> There are multiple double spaces in this document like this one above. > > I presume you are referring to the double space after peroid. Universally, > that was the recommended style (APA guidebook, etc) until a little while ago. > Old habits are hard to break. Will scrub. > >> >>> +The PCIe interface of AIC100 is capable of PCIe Gen4 speeds over eight >>> lanes >>> +(x8). An individual SoC on a card can have up to 16 NSPs for running >>> workloads. >>> +Each SoC has an A53 management CPU. On card, there can be up to 32 GB of >>> DDR. >>> + >>> +Multiple AIC100 cards can be hosted in a single system to scale overall >>> +performance. >>> + >>> +Hardware Description >>> +==================== >>> + >>> +An AIC100 card consists of an AIC100 SoC, on-card DDR, and a set of misc >>> +peripherals (PMICs, etc). >>> + >>> +An AIC100 card can either be a PCIe HHHL form factor (a traditional PCIe >>> card), >>> +or a Dual M.2 card. Both use PCIe to connect to the host system. >> >> Dual M.2 card? Is it a single PCB with two M.2 connectors? This requires >> custom >> motherboard with x4 lanes from two connectors combined as a single PCIe >> device, right? > > Yes. There is a specification for this, although it hasn't gotten widespread > adoption. In addition to more lanes, you also get to draw more power. > Sincle M.2 is around 11W. Dual M.2 is capped at 25W. > > It tends to be a handy form factor for "edge" applications where the physical > size and power draw of a "normal" PCIe slot (what you'd find on a regular ATX > motherboard) is not desirerable. > >> >>> +As a PCIe endpoint/adapter, AIC100 uses the standard VendorID(VID)/ >>> +DeviceID(DID) combination to uniquely identify itself to the host. AIC100 >>> +uses the standard Qualcomm VID (0x17cb). All AIC100 instances use the same >>> +AIC100 DID (0xa100). >> >> Maybe "SKUs" would fit better here then "instances". > > Sure. > >> >>> +AIC100 does not implement FLR (function level reset). >>> + >>> +AIC100 implements MSI but does not implement MSI-X. AIC100 requires 17 >>> MSIs to >>> +operate (1 for MHI, 16 for the DMA Bridge). >>> + >>> +As a PCIe device, AIC100 utilizes BARs to provide host interfaces to the >>> device >>> +hardware. AIC100 provides 3, 64-bit BARs. >>> + >>> +* The first BAR is 4K in size, and exposes the MHI interface to the host. >>> + >>> +* The second BAR is 2M in size, and exposes the DMA Bridge interface to the >>> + host. >>> + >>> +* The third BAR is variable in size based on an individual AIC100's >>> + configuration, but defaults to 64K. This BAR currently has no purpose. >>> + >>> +From the host perspective, AIC100 has several key hardware components- >> >> Typo in "components-". > > ? > You want "components -"?
I meant "components:" but I guess the "-" is part of the format. >> >>> +* QSM (QAIC Service Manager) >>> +* NSPs (Neural Signal Processor) >>> +* DMA Bridge >>> +* DDR >>> +* MHI (Modem Host Interface) >>> + >>> +QSM >>> +--- >>> + >>> +QAIC Service Manager. This is an ARM A53 CPU that runs the primary >>> +firmware of the card and performs on-card management tasks. It also >>> +communicates with the host via MHI. Each AIC100 has one of >>> +these. >> >> I would put description of MHI at the top because it is referenced by the >> QSM description. > > Sure. > >> >>> +NSP >>> +--- >>> + >>> +Neural Signal Processor. Each AIC100 has up to 16 of these. These are >>> +the processors that run the workloads on AIC100. Each NSP is a Qualcomm >>> Hexagon >>> +(Q6) DSP with HVX and HMX. Each NSP can only run one workload at a time, >>> but >>> +multiple NSPs may be assigned to a single workload. Since each NSP can >>> only run >>> +one workload, AIC100 is limited to 16 concurrent workloads. Workload >>> +"scheduling" is under the purview of the host. AIC100 does not >>> automatically >>> +timeslice. >>> + >>> +DMA Bridge >>> +---------- >>> + >>> +The DMA Bridge is custom DMA engine that manages the flow of data >>> +in and out of workloads. AIC100 has one of these. The DMA Bridge has 16 >>> +channels, each consisting of a set of request/response FIFOs. Each active >>> +workload is assigned a single DMA Bridge channel. The DMA Bridge exposes >>> +hardware registers to manage the FIFOs (head/tail pointers), but requires >>> host >>> +memory to store the FIFOs. >>> + >>> +DDR >>> +--- >>> + >>> +AIC100 has on-card DDR. In total, an AIC100 can have up to 32 GB of DDR. >>> +This DDR is used to store workloads, data for the workloads, and is used >>> by the >>> +QSM for managing the device. NSPs are granted access to sections of the >>> DDR by >>> +the QSM. The host does not have direct access to the DDR, and must make >>> +requests to the QSM to transfer data to the DDR. >>> + >>> +MHI >>> +--- >>> + >>> +AIC100 has one MHI interface over PCIe. MHI itself is documented at >> >> Please exand MHI acronym. > > Its expanded about 40 lines up - "* MHI (Modem Host Interface)". I generally > go by the scheme of expanding an acronym the first time it is used in a > document, and then just using the acronym there after. > > Do you feel the expansion needs to be duplicated? It might help when this > section is moved to the top. > No, it's fine. >> >>> +Documentation/mhi/index.rst MHI is the mechanism the host uses to >>> communicate >>> +with the QSM. Except for workload data via the DMA Bridge, all >>> interaction with >>> +he device occurs via MHI. >> >> Typo in "he device". > > Doh. Will fix. > >> >>> +High-level Use Flow >>> +=================== >>> + >>> +AIC100 is a programmable accelerator typically used for running >>> +neural networks in inferencing mode to efficiently perform AI operations. >>> +AIC100 is not intended for training neural networks. AIC100 can be >>> utilitized >> >> utilitized -> utilized > > Sure > >> >>> +for generic compute workloads. >>> + >>> +Assuming a user wants to utilize AIC100, they would follow these steps: >>> + >>> +1. Compile the workload into an ELF targeting the NSP(s) >>> +2. Make requests to the QSM to load the workload and related artifacts >>> into the >>> + device DDR >>> +3. Make a request to the QSM to activate the workload onto a set of idle >>> NSPs >>> +4. Make requests to the DMA Bridge to send input data to the workload to be >>> + processed, and other requests to receive processed output data from the >>> + workload. >>> +5. Once the workload is no longer required, make a request to the QSM to >>> + deactivate the workload, thus putting the NSPs back into an idle state. >>> +6. Once the workload and related artifacts are no longer needed for future >>> + sessions, make requests to the QSM to unload the data from DDR. This >>> frees >>> + the DDR to be used by other users. >>> + >> >> Please specify if this is single or multi user device. > > It is multi-user. I will find a way to clarify that. > >> >>> +Boot Flow >>> +========= >>> + >>> +AIC100 uses a flashless boot flow, derived from Qualcomm MSMs. >> >> What's MSM? > > "Mobile Station Modem". It is is Qualcomm term from the 80s, and used to > describe the "family" Qualcomm phone SoCs. It used to be that the Model > number would be "MSM8660" or "MSM8960", etc. That has changed a bit in the > past few years, but the products are still referred to "MSMs". > > It is a common term in the Qualcomm world, and "MSM" is more well known these > days than what it stands for. I don't think expanding it is going to add > value. > OK >> >>> +When AIC100 is first powered on, it begins executing PBL (Primary >>> Bootloader) >>> +from ROM. PBL enumerates the PCIe link, and initializes the BHI (Boot Host >>> +Interface) component of MHI. >>> + >>> +Using BHI, the host points PBL to the location of the SBL (Secondary >>> Bootloader) >>> +image. The PBL pulls the image from the host, validates it, and begins >>> +execution of SBL. >>> + >>> +SBL initializes MHI, and uses MHI to notify the host that the device has >>> entered >>> +the SBL stage. SBL performs a number of operations: >>> + >>> +* SBL initializes the majority of hardware (anything PBL left >>> uninitialized), >>> + including DDR. >>> +* SBL offloads the bootlog to the host. >>> +* SBL synchonizes timestamps with the host for future logging. >> >> synchonizes -> synchronizes > > Yep > >> >>> +* SBL uses the Sahara protocol to obtain the runtime firmware images from >>> the >>> + host. >>> + >>> +Once SBL has obtained and validated the runtime firmware, it brings the >>> NSPs out >>> +of reset, and jumps into the QSM. >>> + >>> +The QSM uses MHI to notify the host that the device has entered the QSM >>> stage >>> +(AMSS in MHI terms). At this point, the AIC100 device is fully >>> functional, and >>> +ready to process workloads. >>> + >>> +Userspace components >>> +==================== >>> + >>> +Compiler >>> +-------- >>> + >>> +An open compiler for AIC100 based on upstream LLVM can be found at: >>> +https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100-cc >>> + >>> +Usermode Driver (UMD) >>> +--------------------- >>> + >>> +An open UMD that interfaces with the qaic kernel driver can be found at: >>> +https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100 >> >> This repo is empty. > > Correct. That was mentioned in the cover letter. Targeting to post content > this week. Just working through the last steps of our internal process. > >> >>> + >>> +Sahara loader >>> +------------- >>> + >>> +An open implementation of the Sahara protocol called kickstart can be >>> found at: >>> +https://github.com/andersson/qdl >>> + >>> +MHI Channels >>> +============ >>> + >>> +AIC100 defines a number of MHI channels for different purposes. This is a >>> list >>> +of the defined channels, and their uses. >>> + >>> +| QAIC_LOOPBACK >>> +| Channels 0/1 >> >> A would use comma or & here. > > Intresting. I can see that. I think I like "&". Will do that. > >> >>> +| Valid for AMSS >>> +| Any data sent to the device on this channel is sent back to the host. >>> + >>> +| QAIC_SAHARA >>> +| Channels 2/3 >>> +| Valid for SBL >>> +| Used by SBL to obtain the runtime firmware from the host. >>> + >>> +| QAIC_DIAG >>> +| Channels 4/5 >>> +| Valid for AMSS >>> +| Used to communicate with QSM via the Diag protocol. >>> + >>> +| QAIC_SSR >>> +| Channels 6/7 >>> +| Valid for AMSS >>> +| Used to notify the host of subsystem restart events, and to offload SSR >>> crashdumps. >>> + >>> +| QAIC_QDSS >>> +| Channels 8/9 >>> +| Valid for AMSS >>> +| Used for the Qualcomm Debug Subsystem. >>> + >>> +| QAIC_CONTROL >>> +| Channels 10/11 >>> +| Valid for AMSS >>> +| Used for the Neural Network Control (NNC) protocol. This is the primary >>> channel between host and QSM for managing workloads. >>> + >>> +| QAIC_LOGGING >>> +| Channels 12/13 >>> +| Valid for SBL >>> +| Used by the SBL to send the bootlog to the host. >>> + >>> +| QAIC_STATUS >>> +| Channels 14/15 >>> +| Valid for AMSS >>> +| Used to notify the host of Reliability, Accessability, Serviceability >>> (RAS) events. >> >> Accessability -> Accessibility > > Got it. > >> >>> +| QAIC_TELEMETRY >>> +| Channels 16/17 >>> +| Valid for AMSS >>> +| Used to get/set power/thermal/etc attributes. >>> + >>> +| QAIC_DEBUG >>> +| Channels 18/19 >>> +| Valid for AMSS >>> +| Not used. >>> + >>> +| QAIC_TIMESYNC >>> +| Channels 20/21 >>> +| Valid for SBL/AMSS >>> +| Used to synchronize timestamps in the device side logs with the host >>> time source. >>> + >>> +DMA Bridge >>> +========== >>> + >>> +Overview >>> +-------- >>> + >>> +The DMA Bridge is one of the main interfaces to the host from the device >>> +(the other being MHI). As part of activating a workload to run on NSPs, >>> the QSM >>> +assigns that network a DMA Bridge channel. A workload's DMA Bridge channel >>> +(DBC for short) is solely for the use of that workload and is not shared >>> with >>> +other workloads. >>> + >>> +Each DBC is a pair of FIFOs that manage data in and out of the workload. >>> One >>> +FIFO is the request FIFO. The other FIFO is the response FIFO. >>> + >>> +Each DBC contains 4 registers in hardware: >>> + >>> +* Request FIFO head pointer (offset 0x0). Read only to the host. >>> Indicates the >> >> Read only _by_ the host. > > Sure. > >> >>> + latest item in the FIFO the device has consumed. >>> +* Request FIFO tail pointer (offset 0x4). Read/write by the host. Host >>> + increments this register to add new items to the FIFO. >>> +* Response FIFO head pointer (offset 0x8). Read/write by the host. >>> Indicates >>> + the latest item in the FIFO the host has consumed. >>> +* Response FIFO tail pointer (offset 0xc). Read only to the host. Device >> >> Read only _by_ the host. > > Sure. > >> >>> + increments this register to add new items to the FIFO. >>> + >>> +The values in each register are indexes in the FIFO. To get the location >>> of the >>> +FIFO element pointed to by the register: FIFO base address + register * >>> element >>> +size. >>> + >>> +DBC registers are exposed to the host via the second BAR. Each DBC >>> consumes >>> +0x1000 of space in the BAR. >> >> I wouldn't use hex for the sizes. 4KB seems a lot more readable. > > Good point. Will do. > >> >>> +The actual FIFOs are backed by host memory. When sending a request to the >>> QSM >>> +to activate a network, the host must donate memory to be used for the >>> FIFOs. >>> +Due to internal mapping limitations of the device, a single contigious >>> chunk of >> >> contigious -> contiguous > > Got it. > >> >>> +memory must be provided per DBC, which hosts both FIFOs. The request FIFO >>> will >>> +consume the beginning of the memory chunk, and the response FIFO will >>> consume >>> +the end of the memory chunk. >>> + >>> +Request FIFO >>> +------------ >>> + >>> +A request FIFO element has the following structure: >>> + >>> +| { >>> +| u16 req_id; >>> +| u8 seq_id; >>> +| u8 pcie_dma_cmd; >>> +| u32 reserved; >>> +| u64 pcie_dma_source_addr; >>> +| u64 pcie_dma_dest_addr; >>> +| u32 pcie_dma_len; >>> +| u32 reserved; >>> +| u64 doorbell_addr; >>> +| u8 doorbell_attr; >>> +| u8 reserved; >>> +| u16 reserved; >>> +| u32 doorbell_data; >>> +| u32 sem_cmd0; >>> +| u32 sem_cmd1; >>> +| u32 sem_cmd2; >>> +| u32 sem_cmd3; >>> +| } >>> + >>> +Request field descriptions: >>> + >>> +| req_id- request ID. A request FIFO element and a response FIFO element >>> with >>> +| the same request ID refer to the same command. >>> + >>> +| seq_id- sequence ID within a request. Ignored by the DMA Bridge. >>> + >>> +| pcie_dma_cmd- describes the DMA element of this request. >>> +| Bit(7) is the force msi flag, which overrides the DMA Bridge MSI >>> logic >>> +| and generates a MSI when this request is complete, and QSM >>> +| configures the DMA Bridge to look at this bit. >>> +| Bits(6:5) are reserved. >>> +| Bit(4) is the completion code flag, and indicates that the DMA Bridge >>> +| shall generate a response FIFO element when this request is >>> +| complete. >>> +| Bit(3) indicates if this request is a linked list transfer(0) or a >>> bulk >>> +| transfer(1). >>> +| Bit(2) is reserved. >>> +| Bits(1:0) indicate the type of transfer. No transfer(0), to >>> device(1), >>> +| from device(2). Value 3 is illegal. >>> + >>> +| pcie_dma_source_addr- source address for a bulk transfer, or the address >>> of >>> +| the linked list. >>> + >>> +| pcie_dma_dest_addr- destination address for a bulk transfer. >>> + >>> +| pcie_dma_len- length of the bulk transfer. Note that the size of this >>> field >>> +| limits transfers to 4G in size. >>> + >>> +| doorbell_addr- address of the doorbell to ring when this request is >>> complete. >>> + >>> +| doorbell_attr- doorbell attributes. >>> +| Bit(7) indicates if a write to a doorbell is to occur. >>> +| Bits(6:2) are reserved. >>> +| Bits(1:0) contain the encoding of the doorbell length. 0 is 32-bit, >>> +| 1 is 16-bit, 2 is 8-bit, 3 is reserved. The doorbell address >>> +| must be naturally aligned to the specified length. >>> + >>> +| doorbell_data- data to write to the doorbell. Only the bits >>> corresponding to >>> +| the doorbell length are valid. >>> + >>> +| sem_cmdN- semaphore command. >>> +| Bit(31) indicates this semaphore command is enabled. >>> +| Bit(30) is the to-device DMA fence. Block this request until all >>> +| to-device DMA transfers are complete. >>> +| Bit(29) is the from-device DMA fence. Block this request until all >>> +| from-device DMA transfers are complete. >>> +| Bits(28:27) are reserved. >>> +| Bits(26:24) are the semaphore command. 0 is NOP. 1 is init with the >>> +| specified value. 2 is increment. 3 is decrement. 4 is wait >>> +| until the semaphore is equal to the specified value. 5 is wait >>> +| until the semaphore is greater or equal to the specified value. >>> +| 6 is "P", wait until semaphore is greater than 0, then >>> +| decrement by 1. 7 is reserved. >>> +| Bit(23) is reserved. >>> +| Bit(22) is the semaphore sync. 0 is post sync, which means that the >>> +| semaphore operation is done after the DMA transfer. 1 is >>> +| presync, which gates the DMA transfer. Only one presync is >>> +| allowed per request. >>> +| Bit(21) is reserved. >>> +| Bits(20:16) is the index of the semaphore to operate on. >>> +| Bits(15:12) are reserved. >>> +| Bits(11:0) are the semaphore value to use in operations. >> >> It seems to me like structure documentation > > Yes. It can be modeled that way. However the code comes later so it can't > be referenced here yet. I've got a todo to come back and clean that up once > this series is merged. > >> >>> +Overall, a request is processed in 4 steps: >>> + >>> +1. If specified, the presync semaphore condition must be true >>> +2. If enabled, the DMA transfer occurs >>> +3. If specified, the postsync semaphore conditions must be true >>> +4. If enabled, the doorbell is written >>> + >>> +By using the semaphores in conjunction with the workload running on the >>> NSPs, >>> +the data pipeline can be synchronized such that the host can queue multiple >>> +requests of data for the workload to process, but the DMA Bridge will only >>> copy >>> +the data into the memory of the workload when the workload is ready to >>> process >>> +the next input. >>> + >>> +Response FIFO >>> +------------- >>> + >>> +Once a request is fully processed, a response FIFO element is generated if >>> +specified in pcie_dma_cmd. The structure of a response FIFO element: >>> + >>> +| { >>> +| u16 req_id; >>> +| u16 completion_code; >>> +| } >>> + >>> +req_id- matches the req_id of the request that generated this element. >>> + >>> +completion_code- status of this request. 0 is success. non-zero is an >>> error. >>> + >>> +The DMA Bridge will generate a MSI to the host as a reaction to activity >>> in the >>> +response FIFO of a DBC. The DMA Bridge hardware has an IRQ storm >>> mitigation >>> +algorithm, where it will only generate a MSI when the response FIFO >>> transitions >>> +from empty to non-empty (unless force MSI is enabled and triggered). In >>> +response to this MSI, the host is expected to drain the response FIFO, and >>> must >>> +take care to handle any race conditions between draining the FIFO, and the >>> +device inserting elements into the FIFO. >>> + >>> +Neural Network Control (NNC) Protocol >>> +===================================== >>> + >>> +The NNC protocol is how the host makes requests to the QSM to manage >>> workloads. >>> +It uses the QAIC_CONTROL MHI channel. >>> + >>> +Each NNC request is packaged into a message. Each message is a series of >>> +transactions. A passthrough type transaction can contain elements known as >>> +commands. >>> + >>> +QSM requires NNC messages be little endian encoded and the fields be >>> naturally >>> +aligned. Since there are 64-bit elements in some NNC messages, 64-bit >>> alignment >>> +must be maintained. >>> + >>> +A message contains a header and then a series of transactions. A message >>> may be >>> +at most 4K in size from QSM to the host. From the host to the QSM, a >>> message >>> +can be at most 64K (maximum size of a single MHI packet), but there is a >>> +continuation feature where message N+1 can be marked as a continuation of >>> +message N. This is used for exceedingly large DMA xfer transactions. >>> + >>> +Transaction descriptions: >>> + >>> +passthrough- Allows userspace to send an opaque payload directly to the >>> QSM. >>> +This is used for NNC commands. Userspace is responsible for managing >>> +the QSM message requirements in the payload >>> + >>> +dma_xfer- DMA transfer. Describes an object that the QSM should DMA into >>> the >>> +device via address and size tuples. >>> + >>> +activate- Activate a workload onto NSPs. The host must provide memory to >>> be >>> +used by the DBC. >>> + >>> +deactivate- Deactivate an active workload and return the NSPs to idle. >>> + >>> +status- Query the QSM about it's NNC implementation. Returns the NNC >>> version, >>> +and if CRC is used. >>> + >>> +terminate- Release a user's resources. >>> + >>> +dma_xfer_cont- Continuation of a previous DMA transfer. If a DMA transfer >>> +cannot be specified in a single message (highly fragmented), this >>> +transaction can be used to specify more ranges. >>> + >>> +validate_partition- Query to QSM to determine if a partition identifier is >>> +valid. >>> + >>> +Each message is tagged with a user id, and a partition id. The user id >>> allows >>> +QSM to track resources, and release them when the user goes away (eg the >>> process >>> +crashes). A partition id identifies the resource partition that QSM >>> manages, >>> +which this message applies to. >>> + >>> +Messages may have CRCs. Messages should have CRCs applied until the QSM >>> +reports via the status transaction that CRCs are not needed. The QSM on >>> the >>> +SA9000P requires CRCs for black channel safing. >>> + >>> +Subsystem Restart (SSR) >>> +======================= >>> + >>> +SSR is the concept of limiting the impact of an error. An AIC100 device >>> may >>> +have multiple users, each with their own workload running. If the >>> workload of >>> +one user crashes, the fallout of that should be limited to that workload >>> and not >>> +impact other workloads. SSR accomplishes this. >>> + >>> +If a particular workload crashes, QSM notifies the host via the QAIC_SSR >>> MHI >>> +channel. This notification identifies the workload by it's assigned DBC. >>> A >>> +multi-stage recovery process is then used to cleanup both sides, and get >>> the >>> +DBC/NSPs into a working state. >>> + >>> +When SSR occurs, any state in the workload is lost. Any inputs that were >>> in >>> +process, or queued by not yet serviced, are lost. The loaded artifacts >>> will >>> +remain in on-card DDR, but the host will need to re-activate the workload >>> if >>> +it desires to recover the workload. >>> + >>> +Reliability, Accessability, Serviceability (RAS) >> >> Accessability -> Accessibility > > Got it. > >> >>> +================================================ >>> + >>> +AIC100 is expected to be deployed in server systems where RAS ideology is >>> +applied. Simply put, RAS is the concept of detecting, classifying, and >>> +reporting errors. While PCIe has AER (Advanced Error Reporting) which >>> factors >>> +into RAS, AER does not allow for a device to report details about internal >>> +errors. Therefore, AIC100 implements a custom RAS mechanism. When a RAS >>> event >>> +occurs, QSM will report the event with appropriate details via the >>> QAIC_STATUS >>> +MHI channel. A sysadmin may determine that a particular device needs >>> +additional service based on RAS reports. >>> + >>> +Telemetry >>> +========= >>> + >>> +QSM has the ability to report various physical attributes of the device, >>> and in >>> +some cases, to allow the host to control them. Examples include thermal >>> limits, >>> +thermal readings, and power readings. These items are communicated via the >>> +QAIC_TELEMETRY MHI channel >>> diff --git a/Documentation/accel/qaic/index.rst >>> b/Documentation/accel/qaic/index.rst >>> new file mode 100644 >>> index 0000000..ad19b88 >>> --- /dev/null >>> +++ b/Documentation/accel/qaic/index.rst >>> @@ -0,0 +1,13 @@ >>> +.. SPDX-License-Identifier: GPL-2.0-only >>> + >>> +===================================== >>> + accel/qaic Qualcomm Cloud AI driver >>> +===================================== >>> + >>> +The accel/qaic driver supports the Qualcomm Cloud AI machine learning >>> +accelerator cards. >>> + >>> +.. toctree:: >>> + >>> + qaic >>> + aic100 >>> diff --git a/Documentation/accel/qaic/qaic.rst >>> b/Documentation/accel/qaic/qaic.rst >>> new file mode 100644 >>> index 0000000..b0e7a5f >>> --- /dev/null >>> +++ b/Documentation/accel/qaic/qaic.rst >>> @@ -0,0 +1,169 @@ >>> +.. SPDX-License-Identifier: GPL-2.0-only >>> + >>> +============= >>> + QAIC driver >>> +============= >>> + >>> +The QAIC driver is the Kernel Mode Driver (KMD) for the AIC100 family of AI >>> +accelerator products. >>> + >>> +Interrupts >>> +========== >>> + >>> +While the AIC100 DMA Bridge hardware implements an IRQ storm mitigation >>> +mechanism, it is still possible for an IRQ storm to occur. A storm can >>> happen >>> +if the workload is particularly quick, and the host is responsive. If the >>> host >>> +can drain the response FIFO as quickly as the device can insert elements >>> into >>> +it, then the device will frequently transition the response FIFO from >>> empty to >>> +non-empty and generate MSIs at a rate equilivelent to the speed of the >> >> equilivelent -> equivalent > > Sure. > >> >>> +workload's ability to process inputs. The lprnet (license plate reader >>> network) >>> +workload is known to trigger this condition, and can generate in excess of >>> 100k >>> +MSIs per second. It has been observed that most systems cannot tolerate >>> this >>> +for long, and will crash due to some form of watchdog due to the overhead >>> of >>> +the interrupt controller interrupting the host CPU. >>> + >>> +To mitigate this issue, the QAIC driver implements specific IRQ handling. >>> When >>> +QAIC receives an IRQ, it disables that line. This prevents the interrupt >>> +controller from interrupting the CPU. Then AIC drains the FIFO. Once the >>> FIFO >>> +is drained, QAIC implements a "last chance" polling algorithm where QAIC >>> will >>> +sleep for a time to see if the workload will generate more activity. The >>> IRQ >>> +line remains disabled during this time. If no activity is detected, QAIC >>> exits >>> +polling mode and reenables the IRQ line. >>> + >>> +This mitigation in QAIC is very effective. The same lprnet usecase that >>> +generates 100k IRQs per second (per /proc/interrupts) is reduced to >>> roughly 64 >>> +IRQs over 5 minutes while keeping the host system stable, and having the >>> same >>> +workload throughput performance (within run to run noise variation). >>> + >>> + >>> +Neural Network Control (NNC) Protocol >>> +===================================== >>> + >>> +The implementation of NNC is split between the KMD (QAIC) and UMD. In >>> general >>> +QAIC understands how to encode/decode NNC wire protocol, and elements of >>> the >>> +protocol which require kernelspace knowledge to process (for example, >>> mapping >> >> kernelspace is missing a space :P >> >>> +host memory to device IOVAs). QAIC understands the structure of a >>> message, and >>> +all of the transactions. QAIC does not understand commands (the payload >>> of a >>> +passthrough transaction). >>> + >>> +QAIC handles and enforces the required little endianness and 64-bit >>> alignment, >>> +to the degree that it can. Since QAIC does not know the contents of a >>> +passthrough transaction, it relies on the UMD to saitsfy the requirements. >> >> saitsfy -> satisfy > > Will do. > >> >>> +The terminate transaction is of particular use to QAIC. QAIC is not aware >>> of >>> +the resources that are loaded onto a device since the majority of that >>> activity >>> +occurs within NNC commands. As a result, QAIC does not have the means to >>> +roll back userspace activity. To ensure that a userspace client's >>> resources >>> +are fully released in the case of a process crash, or a bug, QAIC uses the >>> +terminate command to let QSM know when a user has gone away, and the >>> resources >>> +can be released. >>> + >>> +QSM can report a version number of the NNC protocol it supports. This is >>> in the >>> +form of a Major number and a Minor number. >>> + >>> +Major number updates indicate changes to the NNC protocol which impact the >>> +message format, or transactions (impacts QAIC). >>> + >>> +Minor number updates indicate changes to the NNC protocol which impact the >>> +commands (does not impact QAIC). >>> + >>> +uAPI >>> +==== >>> + >>> +QAIC defines a number of driver specific IOCTLs as part of the userspace >>> API. >>> +This section describes those APIs. >>> + >>> +DRM_IOCTL_QAIC_MANAGE: >>> +This IOCTL allows userspace to send a NNC request to the QSM. The call >>> will >>> +block until a response is received, or the request has timed out. >>> + >>> +DRM_IOCTL_QAIC_CREATE_BO: >>> +This IOCTL allows userspace to allocate a buffer object (BO) which can >>> send or >>> +receive data from a workload. The call will return a GEM handle that >>> +represents the allocated buffer. The BO is not usable until it has been >>> sliced >>> +(see DRM_IOCTL_QAIC_ATTACH_SLICE_BO). >>> + >>> +DRM_IOCTL_QAIC_MMAP_BO: >>> +This IOCTL allows userspace to prepare an allocated BO to be mmap'd into >>> the >>> +userspace process. >>> + >>> +DRM_IOCTL_QAIC_ATTACH_SLICE_BO: >>> +This IOCTL allows userspace to slice a BO in preparation for sending the >>> BO to >>> +the device. Slicing is the operation of describing what portions of a BO >>> get >>> +sent where to a workload. This requires a set of DMA transfers for the DMA >>> +Bridge, and as such, locks the BO to a specific DBC. >>> + >>> +DRM_IOCTL_QAIC_EXECUTE_BO: >>> +This IOCTL allows userspace to submit a set of sliced BOs to the device. >>> The >>> +call is non-blocking. Success only indicates that the BOs have been queued >>> +to the device, but does not guarantee they have been executed. >>> + >>> +DRM_IOCTL_QAIC_PARTIAL_EXECUTE_BO: >>> +This IOCTL operates like DRM_IOCTL_QAIC_EXECUTE_BO, but it allows >>> userspace to >>> +shrink the BOs sent to the device for this specific call. If a BO >>> typically has >>> +N inputs, but only a subset of those is available, this IOCTL allows >>> userspace >>> +to indicate that only the first M bytes of the BO should be sent to the >>> device >>> +to minimize data transfer overhead. This IOCTL dynamically recomputes the >>> +slicing, and therefore has some processing overhead before the BOs can be >>> queued >>> +to the device. >>> + >>> +DRM_IOCTL_QAIC_WAIT_BO: >>> +This IOCTL allows userspace to determine when a particular BO has been >>> processed >>> +by the device. The call will block until either the BO has been processed >>> and >>> +can be re-queued to the device, or a timeout occurs. >>> + >>> +DRM_IOCTL_QAIC_PERF_STATS_BO: >>> +This IOCTL allows userspace to collect performance statistics on the most >>> +recent execution of a BO. This allows userspace to construct an end to end >>> +timeline of the BO processing for a performance analysis. >>> + >>> +DRM_IOCTL_QAIC_PART_DEV: >>> +This IOCTL allows userspace to request a duplicate "shadow device". This >>> extra >>> +accelN device is associated with a specific partition of resources on the >>> AIC100 >>> +device and can be used for limiting a process to some subset of resources. >>> + >>> +Userspace Client Isolation >>> +========================== >>> + >>> +AIC100 supports multiple clients. Multiple DBCs can be consumed by a >>> single >>> +client, and multiple clients can each consume one or more DBCs. Workloads >>> +may contain sensistive information therefore only the client that owns the >> >> sensistive -> sensitive > > Will do. > >> >>> +workload should be allowed to interface with the DBC. >>> + >>> +Clients are identified by the instance associated with their open(). A >>> client >>> +may only use memory they allocate, and DBCs that are assigned to their >>> +workloads. Attempts to access resources assigned to other clients will be >>> +rejected. >>> + >>> +Module parameters >>> +================= >>> + >>> +QAIC supports the following module parameters: >>> + >>> +**datapath_polling (bool)** >>> + >>> +Configures QAIC to use a polling thread for datapath events instead of >>> relying >>> +on the device interrupts. Useful for platforms with broken multiMSI. >>> Must be >>> +set at QAIC driver initialization. Default is 0 (off). >>> + >>> +**mhi_timeout (int)** >>> + >>> +Sets the timeout value for MHI operations in milliseconds (ms). Must be >>> set >>> +at the time the driver detects a device. Default is 2000 (2 seconds). >>> + >>> +**control_resp_timeout (int)** >>> + >>> +Sets the timeout value for QSM responses to NNC messages in seconds (s). >>> Must >>> +be set at the time the driver is sending a request to QSM. Default is 60 >>> (one >>> +minute). >>> + >>> +**wait_exec_default_timeout (int)** >>> + >>> +Sets the default timeout for the wait_exec ioctl in milliseconds (ms). >>> Must be >>> +set prior to the waic_exec ioctl call. A value specified in the ioctl call >>> +overrides this for that call. Default is 5000 (5 seconds). >>> + >>> +**datapath_poll_interval_us (int)** >>> + >>> +Sets the polling interval in microseconds (us) when datapath polling is >>> active. >>> +Takes effect at the next polling interval. Default is 100 (100 us). >> >> Cool that you are staring with the documentation :) >> I suggest running at least "checkpatch.pl --codespell" on the series as >> there are many spelling issue. > > Thats what happened to checkpatch. Thanks for pointing out the flag. I > remember it doing spellcheck by default. Apparently it needs a flag now. > I'll do that going forward. > >> >> Regards, >> Jacek >> >> >> >
