Raid Controller Troubleshooting
前几天经历了一起磁盘阵列卡的fall事件,到最后总算回复正常了,整个事件过程回顾如下
我们的一台服务器在一次正常关机后,发现不能正常启动了,Raid Controller自检中不能查到VD,进入WebBios看到6块磁盘全部Offline,处于Good Config状态。又着急又不敢轻易作任何操作,人家的案例都是Offline一部分磁盘,且状态为Bad Config,怎奈我的全offline了~
1.更换Raid Controller 首先想到的是Raid Controller的问题,更换同型号的Raid Controller结果无效,且把Raid Controller放回原服务器竟然也都Offline了,由此可以判断并非是Raid Controller的问题。
2.尝试Clear Configuration/Import Foreign Configuration
Google遍中文英文的资料后,得知Raid Configuration会分别写入到RaidController和磁盘上,只要保证盘序没有乱还是有一定几率修复的;由于WebBios执行在Safe Mode不能作更多的操作,仅能支持Clear Configuration操作。想保留磁盘上的Configuration信息,打算先清理Raid Controller的Configuration信息,所以拔掉磁盘和Raid Controller的连接线执行Clear Configuration的操作并执行Import Foreign Configuration,结果报错import fall,再次更换Raid Controller也是相同报错信息。
3.升级Raid Controller firmware 以上方法均无效可以排除是Raid Controller硬件本身的问题,之前怀疑过上周该服务器出现过一次意外的死机,开机时候磁盘被我中止了的磁盘检查会导致磁盘内的Raid Configuration损坏而无法读取的情况也可以排除了。在另外一台主机更换了Raid Controller/Clear Configuration导致出现了的同样的状况,说明只要磁盘Offline过就无法再次Import进来,也看到网上有人提到old firmware stupid issue,检查了下firmware版本已经为最新。
4.Back panel质量问题 经过更换Raid Controller可以排除,不会那么巧吧
5.使用LiveCD执行StorCLI检查配置 这也太难为人了吧,以上solutions均无效~ 既然WebBios仅仅能进入Safe Mode,那使用StorCLI是否可以关闭WebBios的Safe Mode呢,然后做一次New Configuration without init呢?
-
启动Ubuntu Desktop LiveCD
apt install -y openssh-server # 安装sshd systemctl status ssh # 检查sshd systemctl start ssh # 开启sshd sudo password ubuntu # 配置密码
-
远程SSH登陆到该服务器并安装StorCLI
sudo -s # 切换root用户 cd 1.23.02_StorCLI/storcli_All_OS/storcli_All_OS/Ubuntu && dpkg -i storcli_1.23.02_all.deb ln -s /opt/MegaRAID/storcli/storcli64 /usr/local/bin/storcli # 做软链
-
执行命令显示事件日志
storcli /c0 show events seqNum: 0x00000006 Seconds since last reboot: 12 Code: 0x0000021b Class: 2 Locale: 0x20 Event Description: Disabling writes to flash as the part has gone bad Event Data: =========== None Controller = 0 Status = Success Description = None Events = GETEVENTS Controller Properties : ===================== ------------------------------------ Ctrl Status Method Value ------------------------------------ 0 Success handleSuboption Events FOREIGN CONFIGURATION : ===================== ---------------------------------------- DG EID:Slot Type State Size NoVDs ---------------------------------------- 0 - RAID1 Frgn 931.0 GB 1 1 - RAID10 Frgn 7.276 TB 1 ---------------------------------------- NoVDs - Number of VDs in disk group|DG - Diskgroup Total foreign drive groups = 2 Physical Drives = 6 PD LIST : ======= ---------------------------------------------------------------------------------- EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type ---------------------------------------------------------------------------------- 252:0 1 UGood F 931.0 GB SATA HDD N N 512B WDC WD10EZEX-08WN4A0 U - 252:1 0 UGood F 931.0 GB SATA HDD N N 512B WDC WD10EZEX-08WN4A0 U - 252:4 3 UGood F 3.637 TB SATA HDD N N 512B WDC WD4002FYYZ-01B7CB0 U - 252:5 2 UGood F 3.637 TB SATA HDD N N 512B WDC WD4002FYYZ-01B7CB0 U - 252:6 4 UGood F 3.637 TB SATA HDD N N 512B WDC WD4002FYYZ-01B7CB0 U - 252:7 5 UGood F 3.637 TB SATA HDD N N 512B WDC WD4002FYYZ-01B7CB0 U - ----------------------------------------------------------------------------------
-
显示Raid Controller信息
[root@seafile-2 storcli]# storcli /c0 show Generating detailed summary of the adapter, it may take a while to complete. Controller = 0 Status = Success Description = None Product Name = LSI MegaRAID SAS 9260-8i Serial Number = SV10513757 SAS Address = 500605b00306efd0 PCI Address = 00:06:00:00 System Time = 08/17/2017 14:26:37 Mfg. Date = 01/27/11 Controller Time = 08/17/2017 06:26:37 FW Package Build = 12.15.0-0239 FW Version = 2.130.403-4660 BIOS Version = 3.30.02.2_4.16.08.00_0x06060A05 Driver Name = megaraid_sas Driver Version = 06.811.02.00-rh1 Vendor Id = 0x1000 Device Id = 0x79 SubVendor Id = 0x1000 SubDevice Id = 0x9261 Host Interface = PCI-E Device Interface = SAS-6G Bus Number = 6 Device Number = 0 Function Number = 0 Drive Groups = 2 TOPOLOGY : ======== ---------------------------------------------------------------------------- DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace TR ---------------------------------------------------------------------------- 0 - - - - RAID1 Optl N 931.0 GB dflt N N none N N 0 0 - - - RAID1 Optl N 931.0 GB dflt N N none N N 0 0 0 252:0 1 DRIVE Onln N 931.0 GB dflt N N none - N 0 0 1 252:1 0 DRIVE Onln N 931.0 GB dflt N N none - N 1 - - - - RAID10 Optl N 7.276 TB dflt N N none N N 1 0 - - - RAID1 Optl N 3.637 TB dflt N N none N N 1 0 0 252:4 3 DRIVE Onln N 3.637 TB dflt N N none - N 1 0 1 252:5 2 DRIVE Onln N 3.637 TB dflt N N none - N 1 1 - - - RAID1 Optl N 3.637 TB dflt N N none N N 1 1 0 252:6 4 DRIVE Onln N 3.637 TB dflt N N none - N 1 1 1 252:7 5 DRIVE Onln N 3.637 TB dflt N N none - N ---------------------------------------------------------------------------- DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Dgrd=Degraded Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present TR=Transport Ready Virtual Drives = 2 VD LIST : ======= -------------------------------------------------------------- DG/VD TYPE State Access Consist Cache Cac sCC Size Name -------------------------------------------------------------- 0/0 RAID1 Optl RW No RWTD - ON 931.0 GB 1/1 RAID10 Optl RW No RWTD - ON 7.276 TB -------------------------------------------------------------- Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked| Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack| AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled Check Consistency Physical Drives = 6 PD LIST : ======= ---------------------------------------------------------------------------------- EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type ---------------------------------------------------------------------------------- 252:0 1 Onln 0 931.0 GB SATA HDD N N 512B WDC WD10EZEX-08WN4A0 U - 252:1 0 Onln 0 931.0 GB SATA HDD N N 512B WDC WD10EZEX-08WN4A0 U - 252:4 3 Onln 1 3.637 TB SATA HDD N N 512B WDC WD4002FYYZ-01B7CB0 U - 252:5 2 Onln 1 3.637 TB SATA HDD N N 512B WDC WD4002FYYZ-01B7CB0 U - 252:6 4 Onln 1 3.637 TB SATA HDD N N 512B WDC WD4002FYYZ-01B7CB0 U - 252:7 5 Onln 1 3.637 TB SATA HDD N N 512B WDC WD4002FYYZ-01B7CB0 U - ---------------------------------------------------------------------------------- EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition|F-Foreign UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
-
执行命令导入,结果依然显示为safe mode
root@ubuntu:~# storcli /c0/fall import Controller = 0 Status = Failure Description = Controller is booted to safe mode. Command is not supported in this mode
-
safe mode好像就是个死结了一样,到底哪里可以unblock这个环?? 这里找到一个地方兴许可以解开这个结
root@ubuntu:~# storcli /c0 show bios Controller = 0 Status = Success Description = None Controller Properties : ===================== ----------------------------------------------------- Ctrl_Prop Value ----------------------------------------------------- Basic Input/Output System (BIOS) ON Auto Boot Select(ABS) OFF BIOS Boot Mode Safe mode on errors Device Exposure Expose All ----------------------------------------------------- Controller = 0 Status = Success Description = None Controller Properties : ===================== ------------------------------------------------- Ctrl_Prop Value ------------------------------------------------- Basic Input/Output System (BIOS) ON Auto Boot Select(ABS) OFF BIOS Boot Mode Pause on errors Device Exposure Expose All ------------------------------------------------- storcli /c0 set soe=off Controller = 0 Status = Success Description = None Controller Properties : ===================== ------------------------------------------------- Ctrl_Prop Value ------------------------------------------------- Basic Input/Output System (BIOS) ON Auto Boot Select(ABS) OFF BIOS Boot Mode Pause on errors Device Exposure Expose All -------------------------------------------------
-
重启服务器,Raid
Controller提示按"D"屏蔽没有安装BBU的信息,按"Y"继续,此时磁盘已经自动Online了,顺利进入系统;按着前面的步骤将生产主机修复了,数据总算回来了~ 花费了我12+ hours
总结
-
操作尽量在没有重要数据的机器测试完成后在应用到正常服务器上
-
数据备份务必跨主机,本机所有磁盘offline了就没戏了
-
DIY服务器不可靠,如有可能尽量为Raid Controller配备BBU。
-
如果必须DIY服务器,单盘数据,单盘备份都比使用Raid Controller可靠。
-
尽量使用OS Open Standard而非Manufacturer Standard,Raid1/Raid10可选择soft Raid
-
如有可能自建ZFS Server或者Ceph这样分布式存储
参考:
-
[1][LSI MegaRAID attached storage - WebBIOS - Can't import foreign configuration]( https://community.spiceworks.com/topic/2011399-lsi-megaraid-attached-storage-webbios-can-t-import-foreign-configuration )
-
[2][Fix Raid on LSI controllers when a disk is shown as ubad]( http://sudoall.com/fix-raid-on-lsi-controllers-when-a-disk-is-shown-as-ubad/ )
-
[3][MegaRAID® SAS Software User Guide]( http://pleiades.ucsc.edu/doc/lsi/MegaRAID_SAS_Software_User_Guide.pdf )
-
[4][Monitoring RAID drive health using StorCLI and MegaCLI plus other useful commands]( http://tate.cx/monitoring-raid-drive-health-using-storcli-and-other-useful-commands/ )
-
[5][StorCLI commands]( https://www.ibm.com/support/knowledgecenter/en/TI0003K/p8eip/p8eip_drive_commands_storcli.htm )
-
[6][Using the LSI STORcli Utility]( https://communities.cisco.com/docs/DOC-59532 )