Raid Controller Troubleshooting

前几天经历了一起磁盘阵列卡的fall事件,到最后总算回复正常了,整个事件过程回顾如下

我们的一台服务器在一次正常关机后,发现不能正常启动了,Raid Controller自检中不能查到VD,进入WebBios看到6块磁盘全部Offline,处于Good Config状态。又着急又不敢轻易作任何操作,人家的案例都是Offline一部分磁盘,且状态为Bad Config,怎奈我的全offline了~

1.更换Raid Controller 首先想到的是Raid Controller的问题,更换同型号的Raid Controller结果无效,且把Raid Controller放回原服务器竟然也都Offline了,由此可以判断并非是Raid Controller的问题。

2.尝试Clear Configuration/Import Foreign Configuration

Google遍中文英文的资料后,得知Raid Configuration会分别写入到RaidController和磁盘上,只要保证盘序没有乱还是有一定几率修复的;由于WebBios执行在Safe Mode不能作更多的操作,仅能支持Clear Configuration操作。想保留磁盘上的Configuration信息,打算先清理Raid Controller的Configuration信息,所以拔掉磁盘和Raid Controller的连接线执行Clear Configuration的操作并执行Import Foreign Configuration,结果报错import fall,再次更换Raid Controller也是相同报错信息。

3.升级Raid Controller firmware 以上方法均无效可以排除是Raid Controller硬件本身的问题,之前怀疑过上周该服务器出现过一次意外的死机,开机时候磁盘被我中止了的磁盘检查会导致磁盘内的Raid Configuration损坏而无法读取的情况也可以排除了。在另外一台主机更换了Raid Controller/Clear Configuration导致出现了的同样的状况,说明只要磁盘Offline过就无法再次Import进来,也看到网上有人提到old firmware stupid issue,检查了下firmware版本已经为最新。

4.Back panel质量问题 经过更换Raid Controller可以排除,不会那么巧吧

5.使用LiveCD执行StorCLI检查配置 这也太难为人了吧,以上solutions均无效~ 既然WebBios仅仅能进入Safe Mode,那使用StorCLI是否可以关闭WebBios的Safe Mode呢,然后做一次New Configuration without init呢?

  1. 启动Ubuntu Desktop LiveCD

    apt install -y openssh-server     # 安装sshd
    systemctl status ssh              # 检查sshd
    systemctl start ssh               # 开启sshd
    sudo password ubuntu              # 配置密码
  2. 远程SSH登陆到该服务器并安装StorCLI

    sudo -s                                                              # 切换root用户
    cd 1.23.02_StorCLI/storcli_All_OS/storcli_All_OS/Ubuntu && dpkg -i storcli_1.23.02_all.deb
    ln -s /opt/MegaRAID/storcli/storcli64 /usr/local/bin/storcli         # 做软链
  3. 执行命令显示事件日志

    storcli /c0 show events         
    
    seqNum: 0x00000006
    Seconds since last reboot: 12
    Code: 0x0000021b
    Class: 2
    Locale: 0x20
    Event Description: Disabling writes to flash as the part has gone bad
    Event Data:
    ===========
    None
    Controller = 0
    Status = Success
    Description = None
    
    Events = GETEVENTS
    
    Controller Properties :
    =====================
    
    ------------------------------------
    Ctrl Status  Method          Value  
    ------------------------------------
       0 Success handleSuboption Events 
       
       
       FOREIGN CONFIGURATION :
    =====================
    
    ----------------------------------------
    DG EID:Slot Type   State     Size NoVDs 
    ----------------------------------------
     0 -        RAID1  Frgn  931.0 GB     1 
     1 -        RAID10 Frgn  7.276 TB     1 
    ----------------------------------------
    
    NoVDs - Number of VDs in disk group|DG - Diskgroup
    Total foreign drive groups = 2
    Physical Drives = 6
    
    PD LIST :
    =======
    
    ----------------------------------------------------------------------------------
    EID:Slt DID State DG     Size Intf Med SED PI SeSz Model                  Sp Type 
    ----------------------------------------------------------------------------------
    252:0     1 UGood F  931.0 GB SATA HDD N   N  512B WDC WD10EZEX-08WN4A0   U  -    
    252:1     0 UGood F  931.0 GB SATA HDD N   N  512B WDC WD10EZEX-08WN4A0   U  -    
    252:4     3 UGood F  3.637 TB SATA HDD N   N  512B WDC WD4002FYYZ-01B7CB0 U  -    
    252:5     2 UGood F  3.637 TB SATA HDD N   N  512B WDC WD4002FYYZ-01B7CB0 U  -    
    252:6     4 UGood F  3.637 TB SATA HDD N   N  512B WDC WD4002FYYZ-01B7CB0 U  -    
    252:7     5 UGood F  3.637 TB SATA HDD N   N  512B WDC WD4002FYYZ-01B7CB0 U  -    
    ----------------------------------------------------------------------------------
  4. 显示Raid Controller信息

    [root@seafile-2 storcli]# storcli /c0 show
    Generating detailed summary of the adapter, it may take a while to complete.
    
    Controller = 0
    Status = Success
    Description = None
    
    Product Name = LSI MegaRAID SAS 9260-8i
    Serial Number = SV10513757
    SAS Address =  500605b00306efd0
    PCI Address = 00:06:00:00
    System Time = 08/17/2017 14:26:37
    Mfg. Date = 01/27/11
    Controller Time = 08/17/2017 06:26:37
    FW Package Build = 12.15.0-0239
    FW Version = 2.130.403-4660
    BIOS Version = 3.30.02.2_4.16.08.00_0x06060A05
    Driver Name = megaraid_sas
    Driver Version = 06.811.02.00-rh1
    Vendor Id = 0x1000
    Device Id = 0x79
    SubVendor Id = 0x1000
    SubDevice Id = 0x9261
    Host Interface = PCI-E
    Device Interface = SAS-6G
    Bus Number = 6
    Device Number = 0
    Function Number = 0
    Drive Groups = 2
    
    TOPOLOGY :
    ========
    
    ----------------------------------------------------------------------------
    DG Arr Row EID:Slot DID Type   State BT     Size PDC  PI SED DS3  FSpace TR
    ----------------------------------------------------------------------------
     0 -   -   -        -   RAID1  Optl  N  931.0 GB dflt N  N   none N      N
     0 0   -   -        -   RAID1  Optl  N  931.0 GB dflt N  N   none N      N
     0 0   0   252:0    1   DRIVE  Onln  N  931.0 GB dflt N  N   none -      N
     0 0   1   252:1    0   DRIVE  Onln  N  931.0 GB dflt N  N   none -      N
     1 -   -   -        -   RAID10 Optl  N  7.276 TB dflt N  N   none N      N
     1 0   -   -        -   RAID1  Optl  N  3.637 TB dflt N  N   none N      N
     1 0   0   252:4    3   DRIVE  Onln  N  3.637 TB dflt N  N   none -      N
     1 0   1   252:5    2   DRIVE  Onln  N  3.637 TB dflt N  N   none -      N
     1 1   -   -        -   RAID1  Optl  N  3.637 TB dflt N  N   none N      N
     1 1   0   252:6    4   DRIVE  Onln  N  3.637 TB dflt N  N   none -      N
     1 1   1   252:7    5   DRIVE  Onln  N  3.637 TB dflt N  N   none -      N
    ----------------------------------------------------------------------------
    
    DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
    DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Dgrd=Degraded
    Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active
    PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign
    DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present
    TR=Transport Ready
    
    Virtual Drives = 2
    
    VD LIST :
    =======
    
    --------------------------------------------------------------
    DG/VD TYPE   State Access Consist Cache Cac sCC     Size Name
    --------------------------------------------------------------
    0/0   RAID1  Optl  RW     No      RWTD  -   ON  931.0 GB
    1/1   RAID10 Optl  RW     No      RWTD  -   ON  7.276 TB
    --------------------------------------------------------------
    
    Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
    Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked|
    Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
    AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
    Check Consistency
    
    Physical Drives = 6
    
    PD LIST :
    =======
    
    ----------------------------------------------------------------------------------
    EID:Slt DID State DG     Size Intf Med SED PI SeSz Model                  Sp Type
    ----------------------------------------------------------------------------------
    252:0     1 Onln   0 931.0 GB SATA HDD N   N  512B WDC WD10EZEX-08WN4A0   U  -
    252:1     0 Onln   0 931.0 GB SATA HDD N   N  512B WDC WD10EZEX-08WN4A0   U  -
    252:4     3 Onln   1 3.637 TB SATA HDD N   N  512B WDC WD4002FYYZ-01B7CB0 U  -
    252:5     2 Onln   1 3.637 TB SATA HDD N   N  512B WDC WD4002FYYZ-01B7CB0 U  -
    252:6     4 Onln   1 3.637 TB SATA HDD N   N  512B WDC WD4002FYYZ-01B7CB0 U  -
    252:7     5 Onln   1 3.637 TB SATA HDD N   N  512B WDC WD4002FYYZ-01B7CB0 U  -
    ----------------------------------------------------------------------------------
    
    EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
    DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
    UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
    Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
    SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition|F-Foreign
    UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
    CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
  1. 执行命令导入,结果依然显示为safe mode

    root@ubuntu:~# storcli /c0/fall import
    Controller = 0
    Status = Failure
    Description = Controller is booted to safe mode. Command is not supported in this mode
    
  2. safe mode好像就是个死结了一样,到底哪里可以unblock这个环?? 这里找到一个地方兴许可以解开这个结

    root@ubuntu:~# storcli /c0 show bios
    Controller = 0
    Status = Success
    Description = None
    
    
    Controller Properties :
    =====================
    
    -----------------------------------------------------
    Ctrl_Prop                        Value               
    -----------------------------------------------------
    Basic Input/Output System (BIOS) ON                  
    Auto Boot Select(ABS)            OFF                 
    BIOS Boot Mode                   Safe mode on errors 
    Device Exposure                  Expose All          
    -----------------------------------------------------
    
    
    Controller = 0
    Status = Success
    Description = None
    
    
    Controller Properties :
    =====================
    
    -------------------------------------------------
    Ctrl_Prop                        Value           
    -------------------------------------------------
    Basic Input/Output System (BIOS) ON              
    Auto Boot Select(ABS)            OFF             
    BIOS Boot Mode                   Pause on errors 
    Device Exposure                  Expose All      
    -------------------------------------------------
    
    storcli /c0 set soe=off
    Controller = 0
    Status = Success
    Description = None
    
    
    Controller Properties :
    =====================
    
    -------------------------------------------------
    Ctrl_Prop                        Value           
    -------------------------------------------------
    Basic Input/Output System (BIOS) ON              
    Auto Boot Select(ABS)            OFF             
    BIOS Boot Mode                   Pause on errors 
    Device Exposure                  Expose All      
    -------------------------------------------------
  3. 重启服务器,Raid

Controller提示按"D"屏蔽没有安装BBU的信息,按"Y"继续,此时磁盘已经自动Online了,顺利进入系统;按着前面的步骤将生产主机修复了,数据总算回来了~ 花费了我12+ hours

总结

  1. 操作尽量在没有重要数据的机器测试完成后在应用到正常服务器上

  2. 数据备份务必跨主机,本机所有磁盘offline了就没戏了

  3. DIY服务器不可靠,如有可能尽量为Raid Controller配备BBU。

  4. 如果必须DIY服务器,单盘数据,单盘备份都比使用Raid Controller可靠。

  5. 尽量使用OS Open Standard而非Manufacturer Standard,Raid1/Raid10可选择soft Raid

  6. 如有可能自建ZFS Server或者Ceph这样分布式存储

参考:


Tao

1079 Words

2017-08-16 20:00 -0400