Just want to provide an update here. Installed Solaris 11.1 reconfigured everything. Went back to Emulex card since it is a dual port for connect to both switches. Same problem, well the link does not fail, but it is writing at 20k/s.
I am really not sure what to do anymore other that to accept fc target is no longer an option, but I will post in the ora solaris forum. Either this has been an issue for some time or it is a hardware combination or perhaps I am doing something seriously wrong. On Sat, Jun 8, 2013 at 6:57 PM, Heinrich van Riel < [email protected]> wrote: > I took a look at every server that I knew I could power down or that is > slated for removal in the future and I found a qlogic adapter not in use. > > HBA Port WWN: 2100001b3280b > Port Mode: Target > Port ID: 12000 > OS Device Name: Not Applicable > Manufacturer: QLogic Corp. > Model: QLE2460 > Firmware Version: 5.2.1 > FCode/BIOS Version: N/A > Serial Number: not available > Driver Name: COMSTAR QLT > Driver Version: 20100505-1.05 > Type: F-port > State: online > Supported Speeds: 1Gb 2Gb 4Gb > Current Speed: 4Gb > Node WWN: 2000001b3280b > > > Link does not go down but useless, right from the start it is as slow as > the emulex after I made the xfer change. > So it is not a driver issue. > > alloc free read write read write > ----- ----- ----- ----- ----- ----- > 681G 53.8T 5 12 29.9K 51.3K > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 88 0 221K > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 163 0 812K > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 198 0 1.13M > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 88 0 221K > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 187 0 1.02M > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > 681G 53.8T 0 0 0 0 > > This is a clean install of a7 with nothing done other than nic config in > lacp. I did not attempt a reinstall of a5 yet and prob won't either. > I dont know what to do anymore I was going to try OmniOS but there is no > way of knowing if it would work. > > > I will see if I can get approved for a solaris license for one year, if > not I am switching back to windows storage spaces. Cant backup the current > lab on the EMC array to this node in any event since there is no ip > connectivity and fc is a dream. > > Guess I am the only one trying to use it as an fc target and these > problems are not noticed. > > > > On Sat, Jun 8, 2013 at 4:55 PM, Heinrich van Riel < > [email protected]> wrote: > >> changing max-xfer-size causes the link to stay up and no problem are >> reported from stmf. >> >> # Memory_model max-xfer-size >> # ---------------------------------------- >> # Small 131072 - 339968 >> # Medium 339969 - 688128 >> # Large 688129 - 1388544 >> # >> # Range: Min:131072 Max:1388544 Default:339968 >> # >> max-xfer-size=339968; >> >> as soon as I changed it to 339969 the there is no link loss, but I would >> be so lucky that is solves my problem. after a few min it would grind to a >> crawl, so much so that in vmware it will take well over a min to just >> browse a folder, we talking are a few k/s. >> >> Setting it to the max causes the the link to go down again and smtf >> reports the following again: >> FROM STMF:0062568: abort_task_offline called for LPORT: lport abort timed >> out >> >> I also played around with the buffer settings. >> >> Any ideas? >> Thanks, >> >> >> >> On Fri, Jun 7, 2013 at 8:38 PM, Heinrich van Riel < >> [email protected]> wrote: >> >>> New card, different PCI-E slot (removed the other one) different FC >>> switch (same model with same code) older hba firmware (2.72a2) = same >>> result. >>> >>> On the setting changes when it boots it complains about this option, >>> does not exist: szfs_txg_synctime >>> The changes still allowed for a constant write, but at a max of 100Mb/s >>> so not much better than iscsi over 1Gbe. I guess I would need to increase >>> write_limit_override. if i disable the settings again it shows 240MB/s >>> with bursts up to 300, both stats are from VMware's disk perf monitoring >>> while cloning the same VM. >>> >>> All iSCSI luns remain active with no impact. >>> So I will conclude, I guess, it seems to be the problem that was there >>> in 2009 from build ~100 to 128. When I search the error messages all posts >>> date back to 2009. >>> >>> I will try one more thing to reinstall with 151a5 since a server that >>> was removed from the env was running this with no issues, but was using an >>> older emulex HBA, LP10000 PCIX >>> Looking at the notable changes in the release notes past a5 I do see >>> anything that changed that I would think would cause the behavior. Would >>> this just be a waste of time? >>> >>> >>> >>> On Fri, Jun 7, 2013 at 6:36 PM, Heinrich van Riel < >>> [email protected]> wrote: >>> >>>> In the debug info I see 1000's of the following events: >>>> >>>> FROM STMF:0149225: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149225: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149225: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149226: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149226: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149226: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149227: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149227: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149227: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> emlxs1:0149228: port state change from 11 to 11 >>>> FROM STMF:0149228: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149228: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149228: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> :0149228: fct_port_shutdown: port-ffffff1157ff1278, fct_process_logo: >>>> unable to >>>> clean up I/O. iport-ffffff1157ff1378, icmd-ffffff1195463110 >>>> FROM STMF:0149229: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149229: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> FROM STMF:0149229: abort_task_offline called for LPORT: lport abort >>>> timed out >>>> >>>> >>>> And then the following as the port recovers. >>>> >>>> emlxs1:0150128: port state change from 11 to 11 >>>> emlxs1:0150128: port state change from 11 to 0 >>>> emlxs1:0150128: port state change from 0 to 11 >>>> emlxs1:0150128: port state change from 11 to 0 >>>> :0150850: fct_port_initialize: port-ffffff1157ff1278, emlxs initialize >>>> emlxs1:0150950: port state change from 0 to e >>>> emlxs1:0150953: Posting sol ELS 3 (PLOGI) rp_id=fffffd lp_id=22000 >>>> emlxs1:0150953: Processing sol ELS 3 (PLOGI) rp_id=fffffd >>>> emlxs1:0150953: Sol ELS 3 (PLOGI) completed with status 0, did/fffffd >>>> emlxs1:0150953: Posting sol ELS 62 (SCR) rp_id=fffffd lp_id=22000 >>>> emlxs1:0150953: Processing sol ELS 62 (SCR) rp_id=fffffd >>>> emlxs1:0150953: Sol ELS 62 (SCR) completed with status 0, did/fffffd >>>> emlxs1:0151053: Posting sol ELS 3 (PLOGI) rp_id=fffffc lp_id=22000 >>>> emlxs1:0151053: Processing sol ELS 3 (PLOGI) rp_id=fffffc >>>> emlxs1:0151053: Sol ELS 3 (PLOGI) completed with status 0, did/fffffc >>>> emlxs1:0151054: Posting unsol ELS 3 (PLOGI) rp_id=fffc02 lp_id=22000 >>>> emlxs1:0151054: Processing unsol ELS 3 (PLOGI) rp_id=fffc02 >>>> emlxs1:0151054: Posting unsol ELS 20 (PRLI) rp_id=fffc02 lp_id=22000 >>>> emlxs1:0151054: Processing unsol ELS 20 (PRLI) rp_id=fffc02 >>>> emlxs1:0151055: Posting unsol ELS 5 (LOGO) rp_id=fffc02 lp_id=22000 >>>> emlxs1:0151055: Processing unsol ELS 5 (LOGO) rp_id=fffc02 >>>> emlxs1:0151146: Posting unsol ELS 3 (PLOGI) rp_id=21500 lp_id=22000 >>>> emlxs1:0151146: Processing unsol ELS 3 (PLOGI) rp_id=21500 >>>> emlxs1:0151146: Posting unsol ELS 20 (PRLI) rp_id=21500 lp_id=22000 >>>> emlxs1:0151146: Processing unsol ELS 20 (PRLI) rp_id=21500 >>>> emlxs1:0151146: Posting unsol ELS 3 (PLOGI) rp_id=21600 lp_id=22000 >>>> emlxs1:0151146: Processing unsol ELS 3 (PLOGI) rp_id=21600 >>>> emlxs1:0151146: Posting unsol ELS 20 (PRLI) rp_id=21600 lp_id=22000 >>>> emlxs1:0151146: Processing unsol ELS 20 (PRLI) rp_id=21600 >>>> emlxs1:0151338: Posting unsol ELS 3 (PLOGI) rp_id=21500 lp_id=22000 >>>> emlxs1:0151338: Processing unsol ELS 3 (PLOGI) rp_id=21500 >>>> emlxs1:0151338: Posting unsol ELS 20 (PRLI) rp_id=21500 lp_id=22000 >>>> emlxs1:0151338: Processing unsol ELS 20 (PRLI) rp_id=21500 >>>> emlxs1:0151338: Posting unsol ELS 3 (PLOGI) rp_id=21600 lp_id=22000 >>>> emlxs1:0151338: Processing unsol ELS 3 (PLOGI) rp_id=21600 >>>> emlxs1:0151338: Posting unsol ELS 20 (PRLI) rp_id=21600 lp_id=22000 >>>> emlxs1:0151338: Processing unsol ELS 20 (PRLI) rp_id=21600 >>>> emlxs1:0151428: Posting unsol ELS 3 (PLOGI) rp_id=21500 lp_id=22000 >>>> emlxs1:0151428: Processing unsol ELS 3 (PLOGI) rp_id=21500 >>>> emlxs1:0151428: port state change from e to 4 >>>> emlxs1:0151428: Posting unsol ELS 20 (PRLI) rp_id=21500 lp_id=22000 >>>> emlxs1:0151428: Processing unsol ELS 20 (PRLI) rp_id=21500 >>>> emlxs1:0151428: Posting unsol ELS 3 (PLOGI) rp_id=21600 lp_id=22000 >>>> emlxs1:0151428: Processing unsol ELS 3 (PLOGI) rp_id=21600 >>>> emlxs1:0151428: Posting unsol ELS 20 (PRLI) rp_id=21600 lp_id=22000 >>>> emlxs1:0151428: Processing unsol ELS 20 (PRLI) rp_id=21600 >>>> >>>> To be honest it does not really tell me much since I do not understand >>>> comstar to these depths. It would appear that the link fails so either >>>> driver problem or hardware issue? I will replace the LPe11002 with a brand >>>> new unopened one and then give up on FC on OI. >>>> >>>> >>>> >>>> >>>> On Fri, Jun 7, 2013 at 4:54 PM, Heinrich van Riel < >>>> [email protected]> wrote: >>>> >>>>> I did find this in my inbox from 2009, I have been using FC with ZFS >>>>> for quite sometime and only recently retired an install with OI a5 that >>>>> was >>>>> upgraded from opensolaris. It did not do real heavy duty stuff, but I had >>>>> a >>>>> similar problem where we were stuck on build 99 for quite some time. >>>>> >>>>> To Jean-Yves Chevallier@emulex >>>>> Any comments on the future of Emulex with regards to the COMSTAR >>>>> project? >>>>> It seems I am not the only one that have problems using Emulex in >>>>> later builds. For now I am stuck with build 99. >>>>> As always any feedback would be greatly appreciated since we have to >>>>> make a decision of sticking with Opensolaris & COMSTAR or start migrating >>>>> to another solution since we cannot stay on build 99 forever. >>>>> What I am really trying to find out is if there is a roadmap/decision >>>>> to ultimately only support Qlogic HBA’s in target mode. >>>>> >>>>> Response: >>>>> >>>>> >>>>> Sorry for the delay in answering you. I do have news for you. >>>>> First off, the interface used by COMSTAR has changed in recent Nevada >>>>> releases (NV120 and up I believe). Since it is not a public interface we >>>>> had no prior indication on this. >>>>> We know of a number of issues, some on our driver, some on the COMSTAR >>>>> stack. Based on the information we have from you and other community >>>>> members, we have addressed all these issues in our next driver version – >>>>> we >>>>> will know for sure after we run our DVT (driver verification testing) next >>>>> week. Depending on progress, this driver will be part of NV 128 or else NV >>>>> 130. >>>>> I believe it is worth taking another look based on these upcoming >>>>> builds, which I imagine might also include fixes to the rest of the >>>>> COMSTAR >>>>> stack. >>>>> >>>>> Best regards. >>>>> >>>>> >>>>> I can confirm that this was fixed in 128 and all I did was update from >>>>> 99 to 128 and there were no problems. >>>>> Seem like the same problem has now returned and emulex does not appear >>>>> to be a good fit since sun mostly used qlogic. >>>>> >>>>> guess it is back to iscsi only for now. >>>>> >>>>> >>>>> >>>>> On Fri, Jun 7, 2013 at 4:40 PM, Heinrich van Riel < >>>>> [email protected]> wrote: >>>>> >>>>>> I changed the settings. I do see it writing all the time now, but the >>>>>> link still dies after a a few min >>>>>> >>>>>> Jun 7 16:30:57 emlxs: [ID 349649 kern.info] [ 5.0608]emlxs1: >>>>>> NOTICE: 730: Link reset. (Disabling link...) >>>>>> Jun 7 16:30:57 emlxs: [ID 349649 kern.info] [ 5.0333]emlxs1: >>>>>> NOTICE: 710: Link down. >>>>>> Jun 7 16:33:16 emlxs: [ID 349649 kern.info] [ 5.055D]emlxs1: >>>>>> NOTICE: 720: Link up. (4Gb, fabric, target) >>>>>> Jun 7 16:33:16 fct: [ID 132490 kern.notice] NOTICE: emlxs1 LINK UP, >>>>>> portid 22000, topology Fabric Pt-to-Pt,speed 4G >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 7, 2013 at 3:06 PM, Jim Klimov <[email protected]> wrote: >>>>>> >>>>>>> Comment below >>>>>>> >>>>>>> >>>>>>> On 2013-06-07 20:42, Heinrich van Riel wrote: >>>>>>> >>>>>>>> One sec apart cloning 150GB vm from a datastore on EMC to OI. >>>>>>>> >>>>>>>> alloc free read write read write >>>>>>>> ----- ----- ----- ----- ----- ----- >>>>>>>> 309G 54.2T 81 48 452K 1.34M >>>>>>>> 309G 54.2T 0 8.17K 0 258M >>>>>>>> 310G 54.2T 0 16.3K 0 510M >>>>>>>> 310G 54.2T 0 0 0 0 >>>>>>>> 310G 54.2T 0 0 0 0 >>>>>>>> 310G 54.2T 0 0 0 0 >>>>>>>> 310G 54.2T 0 10.1K 0 320M >>>>>>>> 311G 54.2T 0 26.1K 0 820M >>>>>>>> 311G 54.2T 0 0 0 0 >>>>>>>> 311G 54.2T 0 0 0 0 >>>>>>>> 311G 54.2T 0 0 0 0 >>>>>>>> 311G 54.2T 0 10.6K 0 333M >>>>>>>> 313G 54.2T 0 27.4K 0 860M >>>>>>>> 313G 54.2T 0 0 0 0 >>>>>>>> 313G 54.2T 0 0 0 0 >>>>>>>> 313G 54.2T 0 0 0 0 >>>>>>>> 313G 54.2T 0 9.69K 0 305M >>>>>>>> 314G 54.2T 0 10.8K 0 337M >>>>>>>> >>>>>>> ... >>>>>>> Were it not for your complaints about link resets and "unusable" >>>>>>> connections, I'd say this looks like a normal behavior for async >>>>>>> writes: they get cached up, and every 5 sec you have a transaction >>>>>>> group (TXG) sync which flushes the writes from cache to disks. >>>>>>> >>>>>>> In fact, the picture still looks like that, and possibly is the >>>>>>> reason for hiccups. >>>>>>> >>>>>>> The TXG sync may be an IO intensive process, which may block or >>>>>>> delay many other system tasks; previously when the interval >>>>>>> defaulted to 30 sec we got unusable SSH connections and temporarily >>>>>>> "hung" disk requests on the storage server every half a minute when >>>>>>> it was really busy (i.e. initial filling up with data from older >>>>>>> boxes). It cached up about 10 seconds worth of writes, then spewed >>>>>>> them out and could do nothing else. I don't think I ever saw network >>>>>>> connections timing out or NICs reporting resets due to this, but I >>>>>>> wouldn't be surprised if this were the cause for your case, though >>>>>>> (i.e. disk IO threads preempting HBA/NIC threads for too long >>>>>>> somehow, making the driver very puzzled about staleness state of its >>>>>>> card). >>>>>>> >>>>>>> At the very least, TXG syncs can be tuned by two knobs: the time >>>>>>> limit (5 sec default) and the size limit (when the cache is "this" >>>>>>> full, begin the sync to disk). The latter is a realistic figure that >>>>>>> can allow you to sync in shorter bursts - with less interruptions >>>>>>> to smooth IO and process work. >>>>>>> >>>>>>> A somewhat related tunable is the number of requests that ZFS would >>>>>>> queue up for a disk. Depending on its NCQ/TCQ abilities and random >>>>>>> IO abilities (HDD vs. SSD), long or short queues may be preferable. >>>>>>> See also: http://www.solarisinternals.**com/wiki/index.php/ZFS_Evil_ >>>>>>> **Tuning_Guide#Device_I.2FO_**Queue_Size_.28I.2FO_**Concurrency.29<http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29> >>>>>>> >>>>>>> These tunables can be set at runtime with "mdb -K", as well as in >>>>>>> the /etc/system file to survive reboots. One of our storage boxes >>>>>>> has these example values in /etc/system: >>>>>>> >>>>>>> *# default: flush txg every 5sec (may be max 30sec, optimize >>>>>>> *# for 5 sec writing) >>>>>>> set zfs:zfs_txg_synctime = 5 >>>>>>> >>>>>>> *# Spool to disk when the ZFS cache is 0x18000000 (384Mb) full >>>>>>> set zfs:zfs_write_limit_override = 0x18000000 >>>>>>> *# ...for realtime changes use mdb. >>>>>>> *# Example sets 0x18000000 (384Mb, 402653184 b): >>>>>>> *# echo zfs_write_limit_override/**W0t402653184 | mdb -kw >>>>>>> >>>>>>> *# ZFS queue depth per disk >>>>>>> set zfs:zfs_vdev_max_pending = 3 >>>>>>> >>>>>>> HTH, >>>>>>> //Jim Klimov >>>>>>> >>>>>>> >>>>>>> ______________________________**_________________ >>>>>>> OpenIndiana-discuss mailing list >>>>>>> OpenIndiana-discuss@**openindiana.org<[email protected]> >>>>>>> http://openindiana.org/**mailman/listinfo/openindiana-**discuss<http://openindiana.org/mailman/listinfo/openindiana-discuss> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > _______________________________________________ OpenIndiana-discuss mailing list [email protected] http://openindiana.org/mailman/listinfo/openindiana-discuss
