Try   HackMD

使用 MegaCli 換硬碟

tags: c4lab

Intro

我們的 server 的 storage 是用 megaraid 去蓋的
所以要查看硬碟 有兩種方式

  1. 關機 進 Storage 的 BIOS 看(之前都這樣)(這個 BIOS 之前案 CTRL + R)
  2. megacli + smartctl (這次要講的)

Concept Overview: 相關名詞都在這裡ㄌ

Preparation

總之就是要下載 MegaCli, yum 跟 apt 沒有

MegaCli

MegaCli download Site:
https://www.broadcom.com/support/download-search?pg=&pf=&pn=&pa=&po=&dk=megacli&pl=

# download
wget https://docs.broadcom.com/docs-and-downloads/raid-controllers/raid-controllers-common-files/8-07-14_MegaCLI.zip
# Unzip
unzip 8-07-14_MegaCLI.zip
cd Linux
# install
sudo yum localinstall MegaCli-8.07.14-1.noarch.rpm

Manual of MegaCli command line
https://www.alteeve.com/w/MegaCli64_Cheat_Sheet

Output

List all the HDD

[linnil1@lncrna MegaCli]$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aAll 
Enclosure Device ID: 8
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: N/A 
Device Id: 0
WWN: 50014xxxxxxxxxx
Sequence Number: 2
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 5.458 TB [0x2baa0f4b0 Sectors]
Non Coerced Size: 5.457 TB [0x2ba90f4b0 Sectors]
Coerced Size: 5.457 TB [0x2ba900000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Device Firmware Level: 0A82
Shield Counter: 0
Successful diagnostics completion on :  N/A 
SAS Address(0): 0x500304xxxxxxxxxx
Inquiry Data:      WD-WX31D95Hxxxxxxx WD60EFRX-xxxxxxx                    82.00A82
Device Speed: 6.0Gb/s 

List all Virtual drives

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -ldinfo -lALL -aALL
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)                                     
Name                :raid6vd01                                      
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 21.829 TB
Sector Size         : 512      
Parity Size         : 7.276 TB          
State               : Optimal
Strip Size          : 128 KB
Number Of Drives    : 8                                                                                                                               
Span Depth          : 1                                                                                                                               
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write   
Current Access Policy: Read/Write   
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No       
Is VD Cached: Yes          
Cache Cade Type : Read Only

如果是想要換的硬碟,可以這樣子看序號 把以下的 15 換成 Device Id

[linnil1@lncrna MegaCli]$ sudo smartctl -d megaraid,15 -a /dev/sda

=== START OF INFORMATION SECTION ===
Model Family:     Seagate NAS HDD
Device Model:     ST3000VN000-xxxxxx
Serial Number:    Zxxxxx
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

Reference

Identify Bad Disk

壞掉的硬碟 會讓 raid1/raid5/raid6 變成 degraded

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -AdpAllInfo -aALl

                Device Present              
                ================            
Virtual Drives    : 3                       
  Degraded        : 1                                                   
  Offline         : 0                       
Physical Devices  : 24                      
  Disks           : 24                       
  Critical Disks  : 0                       
  Failed Disks    : 1                        

查看哪個硬碟 fail

(base) [linnil1@exon MegaCli]$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aAll  | egrep "(Arm)|(Device Id)|(Error)|(state)"
Device Id: 13                                  
Media Error Count: 0                           
Other Error Count: 0                           
Firmware state: Online, Spun Up                
Drive's position: DiskGroup: 1, Span: 0, Arm: 6
Device Id: 15                                  
Media Error Count: 495                         
Other Error Count: 3                           
Firmware state: Failed     

當然 如果壞掉的話 說不定連連都聯不進去

(base) [linnil1@exon ~]$ sudo smartctl -d sat+megaraid,15 -a /dev/sda

smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.31.1.el6.x86_64] (local build)                         
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net 

Smartctl: Device Read Identity Failed: megasas_cmd result: 0.15 = 0/46

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

Make Sure Hot-plug for Motherboard

去察看你的主機板資訊 或者是 販售電腦的型號
確定你可以直接插拔 HDD

Mother Board PDF: https://www.supermicro.com/manuals/motherboard/C606_602/MNL-1258.pdf

Remove it

需要用 megacli 把 壞掉ㄉ Disk 標記成 removable

參考 https://www.advancedclustering.com/act_kb/replacing-a-disk-with-megacli/

(待補)

The parameter is -physdrv[<enclosure_ID>:<slot_id>] , e.g. -physdrv[8:14]

移除前務必確認

sudo ./MegaCli64 -pdInfo -PhysDrv[8:14] -a0

然後移除

MegaCli64 -pdoffline       -physdrv[8:14] -a0
MegaCli64 -pdmarkmissing   -physdrv[8:14] -a0
MegaCli64 -pdprprmv        -physdrv[8:14] -a0

設定他閃紅燈

MegaCli64 -pdlocate -start -physdrv[8:14] -a0

然後該硬碟外面的燈會變成紅色

實體拔出來

(應該沒問題吧)

MegaCli Double Check

少一顆

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -AdpAllInfo -aALl

                Device Present              
                ================            
Virtual Drives    : 3                       
  Degraded        : 1                                                   
  Offline         : 0                       
Physical Devices  : 24      
  Disks           : 23                       
  Critical Disks  : 0                       
  Failed Disks    : 0                        

VD1 顯示 Partially Degraded

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -ldinfo -lALL -aALL
Virtual Drive: 1 (Target Id: 1)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 43.660 TB
Sector Size         : 512
Parity Size         : 10.915 TB
State               : Partially Degraded
Strip Size          : 64 KB
Number Of Drives    : 10
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: Yes
Cache Cade Type : Read Only

看同一個位置

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -pdInfo -PhysDrv[8:14] -aALL 
                                     
Adapter 0: Device at Enclosure - 8, Slot - 14 is not found.

Exit Code: 0x00

換上新硬碟

Disk

確認規格 (Space, read/write speed, serial number, model number)
記得統編發票
拍照

(新的是 WD60EFZX)

MegaCli: Prepare for rebuilding

找到插上的 disk


(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -pdInfo -PhysDrv[8:14] -a0
Enclosure Device ID: 8
Slot Number: 14
Enclosure position: N/A
Device Id: 15
WWN: 50014xxxxxxxxxxx
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 5.458 TB [0x2baa0f4b0 Sectors]
Non Coerced Size: 5.457 TB [0x2ba90f4b0 Sectors]
Coerced Size: 5.457 TB [0x2ba900000 Sectors]
Sector Size:  0
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: 0A81
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x50030480xxxxxxx
Connected Port Number: 0(path0)
Inquiry Data:          WD-C81KHxxxxxx WD60EFZX-xxxxxxx                   81.00A81

插上後的數量

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -AdpAllInfo -aALl

                Device Present              
                ================            
Virtual Drives    : 3                       
  Degraded        : 1                                                   
  Offline         : 0                       
Physical Devices  : 25  (missing 跟 unconfigured)    
  Disks           : 24                       
  Critical Disks  : 0                       
  Failed Disks    : 0                        

找到它屬於的位置

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -PdgetMissing -a0
                                     
    Adapter 0 - Missing Physical drives

    No.   Array   Row   Size Expected
    0     1       6     5722624 MB

Exit Code: 0x00

Megacli rebuild

填上她的位置 array row

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -PdReplaceMissing  -PhysDrv[8:14] -array1 -row6 -a0
                       
Adapter: 0: Missing PD at Array 1, Row 6 is replaced.
                                        
Exit Code: 0x00 


(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -pdrbld -start -PhysDrv[8:14]  -a0
                                     
Started rebuild progress on device(Encl-8 Slot-14)

Exit Code: 0x00

Megacli rebuild progress

同時 你應該會看到 目前正在 rebuild 的硬碟 的紅燈在閃爍中

以下只是查看而已

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -PdgetMissing -a0
                                     
    Adapter 0 - No Missing Drive is Found.

Exit Code: 0x00


(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -pdInfo -PhysDrv[8:14] -a0
Firmware state: Rebuild


(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -pdrbld -ShowProg -PhysDrv[8:14] -a0
                                     
Rebuild Progress on Device at Enclosure 8, Slot 14 Completed 0% in 3 Minutes.

Exit Code: 0x00

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -pdrbld -ShowProg -PhysDrv[8:14] -a0
[sudo] password for linnil1: 
                                     
Rebuild Progress on Device at Enclosure 8, Slot 14 Completed 17% in 135 Minutes.

Exit Code: 0x00

參考這個 https://www.advancedclustering.com/act_kb/replacing-a-disk-with-megacli/

Megacli rebuild Done

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -pdrbld -ShowProg -PhysDrv[8:14] -a0
                                     
Device(Encl-8 Slot-14) is not in rebuild process

Exit Code: 0x00

都是正常的

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -AdpAllInfo -aALl

                Device Present              
                ================            
Virtual Drives    : 3                       
  Degraded        : 0                        
  Offline         : 0                       
Physical Devices  : 25                       
  Disks           : 24                      
  Critical Disks  : 0                       
  Failed Disks    : 0                   

state 從 degraded -> optimal

(base) [linnil1@exon MegaCli]$ sudo ./MegaCli64 -ldinfo -lALL -aALL

Virtual Drive: 1 (Target Id: 1)

State               : Optimal

Installation bug?

Install on Ubuntu

https://www.broadcom.com/support/knowledgebase/1211161500661/installing-megacli-in-debian-or-ubuntu

libncurses.so.5 not found

(env) [linnil1@rna server]$ sudo  /opt/MegaRAID/MegaCli/MegaCli64
/opt/MegaRAID/MegaCli/MegaCli64: error while loading shared libraries: libncurses.so.5: cannot open shared object file: No such file or directory

You can install the package(centos8)

sudo yum install ncurses-compat-libs

Install old ncurses library from apt(ubuntu20.04)
https://askubuntu.com/questions/1252062/how-to-install-libncurses-so-5-in-ubuntu-20-04

sudo add-apt-repository universe
sudo aptinstall libncurses5

新增硬碟

插上硬碟後

Enclosure Device ID: 8                                                                                                                                                                                           Slot Number: 17                                                                                                                                                                                                  Enclosure position: N/A                                                                                                                                                                                          Device Id: 20                                                                                                                                                                                                    WWN: 5000cca2c1d1020f                                                                                                                                                                                            Sequence Number: 7
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 14.552 TB [0x746c00000 Sectors]
Non Coerced Size: 14.551 TB [0x746b00000 Sectors]
Coerced Size: 14.551 TB [0x746b00000 Sectors]
Sector Size:  0
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: W232
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500304801780011d
Connected Port Number: 0(path0)
Inquiry Data: 2PH6DW3J            WDC  WUH721816ALE6L4                    PCGNW232
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive:  Not Certified
Drive Temperature : N/A
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No
(base) linnil1@exon:~$ sudo ./MegaCli64 -cfgldadd -r6 [8:8,8:9,8:10,8:11,8:14,8:15,8:16,8:17] -a0

Adapter 0: Created VD 1

Adapter 0: Configured the Adapter!!

Exit Code: 0x00
(base) linnil1@exon:~$ sudo ./MegaCli64 -h
(base) linnil1@exon:~$ sudo ./MegaCli64 -LDinfo -L1 -aAll

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 1 (Target Id: 1)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 87.313 TB
Sector Size         : 512
Parity Size         : 29.104 TB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 8
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Ongoing Progresses:
  Background Initialization: Completed 0%, Taken 0 min.
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: Yes
Cache Cade Type : Read Only