Control: tags -1 + moreinfo

Hi Filippo,

On Wed, Nov 19, 2025 at 01:46:21PM +0000, Filippo Giunchedi wrote:
> Source: linux
> Version: 6.12.57-1
> Severity: important
> 
> Dear Maintainer,
> At Wikimedia Foundation we are running Trixie debian-installer on Dell r450
> hardware with an mpt3sas (HBA355i with id 1000:00e6) controller and SSD
> attached. While debian-installer finished successfully, grub was then unable 
> to
> boot the installed system.
> 
> Partman is instructed to assemble a raid10 over four devices with LVM on top.
> Upon inspection the LVM PV is created with ~4GB metadata area which tricks 
> grub
> into allocating the same amount of memory during LVM detection. While
> grub-install taking ~4GB of RAM "works" during installation, albeit
> grub-install being quite slow, it obviously fails when booting.
> 
> I tracked down the problem to md0 reporting optimal_io_size of ~4GB, and LVM
> defaults to align metadata with said size, resulting in abnormally large
> PV metadata area.
> 
> The large md0 optimal_io_size seems to come from component devices reporting
> 16MB optimal_io_size as shown below.
> 
> This host was working fine with Bookworm, which makes me think something has
> changed in mpt3sas.
> 
> My understanding is that the controller queries devices via block limits VPD
> page for these values, and I'm attaching the output below. The original task
> which spawned this work is https://phabricator.wikimedia.org/T407586
> 
> I'm happy to conduct further testing for bug fixes and/or investigation.
> 
> best,
> Filippo
> 
> ====
> 
> # uname -a
> Linux cloudcontrol2010-dev 6.12.57+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 
> 6.12.57-1 (2025-11-05) x86_64 GNU/Linux
> 
> # lsblk -t
> NAME           ALIGNMENT MIN-IO     OPT-IO PHY-SEC LOG-SEC ROTA SCHED       
> RQ-SIZE      RA WSAME
> sda                    0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> |-sda1                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> |-sda2                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> `-sda3                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
>   `-md0                0 524288 4293918720    4096     512    0               
>       4192256    0B
>     |-vg0-swap         0 524288 4293918720    4096     512    0               
>       4192256    0B
>     |-vg0-root         0 524288 4293918720    4096     512    0               
>       4192256    0B
>     `-vg0-srv          0 524288 4293918720    4096     512    0               
>       4192256    0B
> sdb                    0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> |-sdb1                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> |-sdb2                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> `-sdb3                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
>   `-md0                0 524288 4293918720    4096     512    0               
>       4192256    0B
>     |-vg0-swap         0 524288 4293918720    4096     512    0               
>       4192256    0B
>     |-vg0-root         0 524288 4293918720    4096     512    0               
>       4192256    0B
>     `-vg0-srv          0 524288 4293918720    4096     512    0               
>       4192256    0B
> sdc                    0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> |-sdc1                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> |-sdc2                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> `-sdc3                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
>   `-md0                0 524288 4293918720    4096     512    0               
>       4192256    0B
>     |-vg0-swap         0 524288 4293918720    4096     512    0               
>       4192256    0B
>     |-vg0-root         0 524288 4293918720    4096     512    0               
>       4192256    0B
>     `-vg0-srv          0 524288 4293918720    4096     512    0               
>       4192256    0B
> sdd                    0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> |-sdd1                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> |-sdd2                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
> `-sdd3                 0   4096   16773120    4096     512    0 mq-deadline   
>   256   32760    0B
>   `-md0                0 524288 4293918720    4096     512    0               
>       4192256    0B
>     |-vg0-swap         0 524288 4293918720    4096     512    0               
>       4192256    0B
>     |-vg0-root         0 524288 4293918720    4096     512    0               
>       4192256    0B
>     `-vg0-srv          0 524288 4293918720    4096     512    0               
>       4192256    0B
> 
> 
> # pvck --dump headers /dev/md0
>   label_header at 512
>   label_header.id LABELONE
>   label_header.sector 1
>   label_header.crc 0xbdf3a961
>   label_header.offset 32
>   label_header.type LVM2 001
>   pv_header at 544
>   pv_header.pv_uuid KDkSuWsrIico15Y0PenxiLzT8Ad2dGLa
>   pv_header.device_size 1919546294272
>   pv_header.disk_locn[0] at 584 # location of data area
>   pv_header.disk_locn[0].offset 4293918720
>   pv_header.disk_locn[0].size 0
>   pv_header.disk_locn[1] at 600 # location list end
>   pv_header.disk_locn[1].offset 0
>   pv_header.disk_locn[1].size 0
>   pv_header.disk_locn[2] at 616 # location of metadata area
>   pv_header.disk_locn[2].offset 4096
>   pv_header.disk_locn[2].size 4293914624
>   pv_header.disk_locn[3] at 632 # location list end
>   pv_header.disk_locn[3].offset 0
>   pv_header.disk_locn[3].size 0
>   pv_header_extension at 648
>   pv_header_extension.version 2
>   pv_header_extension.flags 1
>   pv_header_extension.disk_locn[0] at 656 # location list end
>   pv_header_extension.disk_locn[0].offset 0
>   pv_header_extension.disk_locn[0].size 0
>   mda_header_1 at 4096 # metadata area
>   mda_header_1.checksum 0x84d8039
>   mda_header_1.magic 0x204c564d3220785b35412572304e2a3e
>   mda_header_1.version 1
>   mda_header_1.start 4096
>   mda_header_1.size 4293914624
>   mda_header_1.raw_locn[0] at 4136 # commit
>   mda_header_1.raw_locn[0].offset 4608
>   mda_header_1.raw_locn[0].size 1724
>   mda_header_1.raw_locn[0].checksum 0xdd78f68b
>   mda_header_1.raw_locn[0].flags 0x0
>   mda_header_1.raw_locn[1] at 4160 # precommit
>   mda_header_1.raw_locn[1].offset 0
>   mda_header_1.raw_locn[1].size 0
>   mda_header_1.raw_locn[1].checksum 0x0
>   mda_header_1.raw_locn[1].flags 0x0
>   metadata text at 8704 crc 0xdd78f68b # vgname vg0 seqno 4
> 
> # Devices are all reporting the same information
> 
> # sg_vpd -p bl /dev/sda
> Block limits VPD page (SBC)
>   Write same non-zero (WSNZ): 1
>   Maximum compare and write length: 0 blocks [command not implemented]
>   Optimal transfer length granularity: 0 blocks [not reported]
>   Maximum transfer length: 0 blocks [not reported]
>   Optimal transfer length: 0 blocks [not reported]
>   Maximum prefetch length: 0 blocks [not reported]
>   Maximum unmap LBA count: 0x3ffff
>   Maximum unmap block descriptor count: 0x20
>   Optimal unmap granularity: 0x1
>   Unmap granularity alignment valid: false
>   Maximum write same length: 0xffff
>   Maximum atomic transfer length: 0 blocks [not reported]
>   Atomic alignment: 0 blocks [unaligned atomic writes permitted]
>   Atomic transfer length granularity: 0 blocks [no granularity requirement]
>   Maximum atomic transfer length with atomic boundary: 0 blocks [not reported]
>   Maximum atomic boundary size: 0 blocks [can only write atomic 1 block]
> 
> # sg_vpd -p ai /dev/sda
> ATA information VPD page:
>   SAT Vendor identification: LSI
>   SAT Product identification: LSI SATL
>   SAT Product revision level: 0008
>   Device signature indicates SATA transport
>   Command code: 0xec
>   ATA command IDENTIFY DEVICE response summary:
>     model: MTFDDAK960TGA-1BC1ZABDA
>     serial number:         XXX
>     firmware revision:  D4DK003
> 
> c3:00.0 Serial Attached SCSI controller: Broadcom / LSI Fusion-MPT 
> 12GSAS/PCIe Secure SAS38xx
>         Subsystem: Dell HBA355i Front
>         Flags: bus master, fast devsel, latency 0, IRQ 16
>         Memory at e6800000 (64-bit, prefetchable) [size=1M]
>         Memory at e6900000 (64-bit, prefetchable) [size=1M]
>         Memory at e6a00000 (32-bit, non-prefetchable) [size=1M]
>         I/O ports at e000 [size=256]
>         Expansion ROM at <ignored> [disabled]
>         Capabilities: [40] Power Management version 3
>         Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
>         Capabilities: [70] Express Endpoint, IntMsgNum 0
>         Capabilities: [b0] MSI-X: Enable+ Count=128 Masked-
>         Capabilities: [100] Advanced Error Reporting
>         Capabilities: [148] Power Budgeting <?>
>         Capabilities: [158] Alternative Routing-ID Interpretation (ARI)
>         Capabilities: [168] Secondary PCI Express
>         Capabilities: [188] Physical Layer 16.0 GT/s <?>
>         Capabilities: [1b0] Lane Margining at the Receiver
>         Capabilities: [218] Dynamic Power Allocation <?>
>         Capabilities: [248] Vendor Specific Information: ID=0002 Rev=4 
> Len=100 <?>
>         Capabilities: [348] Vendor Specific Information: ID=0001 Rev=1 
> Len=038 <?>
>         Capabilities: [380] Data Link Feature <?>
>         Kernel driver in use: mpt3sas

This sounds like quite an intersting finding but probably hard to
reproduce without the hardware if it comes to be specific to the
controller type and driver.

I would like to ask: Do you have the possibility to make an OS
instalaltion such that you can freely experiment with various kernels
and then under them assemble the arrays? If so that would be great
that you could start bisecting the changes to find where find changes.

I.e. install the OS independtly on the controller, find by bisecting
Debian versions manually the kernels between bookworm and trixie
(6.1.y -> 6.12.y to narrow down the upsream range).

Then bisect the ustream changes to find the offending commits. Let me
know if you need more specific instructions on the idea. 

Additionally it would be interesting to know if the issue persist in
6.17.8 or even 6.18~rc6-1~exp1 to be able to clearly indicate upstream
that the issue persist in upper kernels. 

Idealy actually this goes asap to upstream once we are more confident
ont the subsystem to where to report the issue. If we are reasonably
confident it it mpt3sas specific already then I would say to go
already to:

./scripts/get_maintainer.pl ./drivers/scsi/mpt3sas
Sathya Prakash <[email protected]> (maintainer:LSILOGIC MPT FUSION 
DRIVERS (FC/SAS/SPI))
Sreekanth Reddy <[email protected]> (maintainer:LSILOGIC MPT FUSION 
DRIVERS (FC/SAS/SPI))
Suganath Prabu Subramani <[email protected]> 
(maintainer:LSILOGIC MPT FUSION DRIVERS (FC/SAS/SPI))
"James E.J. Bottomley" <[email protected]> (maintainer:SCSI 
SUBSYSTEM)
"Martin K. Petersen" <[email protected]> (maintainer:SCSI SUBSYSTEM)
[email protected] (open list:LSILOGIC MPT FUSION DRIVERS 
(FC/SAS/SPI))
[email protected] (open list:LSILOGIC MPT FUSION DRIVERS (FC/SAS/SPI))
[email protected] (open list)

Do you concur?

Regards,
Salvatore

Reply via email to