Opérations Incident / Maintenance DDN Câble SAS avec mis en AWL de vd situés sur le même contrôleur. === ###### tags: `Lustre` ## Premières actions à réaliser ### Passage en mode Degraded des OST ``` [root@p3mgt01]# opcm ddn-exascaler oss-attribute-manage -noderange scr13oss -attribute 'degraded' -set 1 Setting Attribute Value: [Degraded] = [1] on Noderange: [scr13o0[1-8]-adm]? (Yes/No) Yes [p3mgt01]: scr13o05-adm: error: set_param: param_path 'obdfilter/*/degraded': No such file or directory [p3mgt01]: scr13o01-adm: error: set_param: param_path 'obdfilter/*/degraded': No such file or directory scr13o02-adm: obdfilter.scratch-OST0061.degraded=1 scr13o03-adm: obdfilter.scratch-OST0062.degraded=1 scr13o04-adm: obdfilter.scratch-OST0063.degraded=1 scr13o06-adm: obdfilter.scratch-OST0065.degraded=1 scr13o07-adm: obdfilter.scratch-OST0066.degraded=1 scr13o08-adm: obdfilter.scratch-OST0067.degraded=1 ``` ### Mise en mode standby des OST du contrôleur impacté ``` for node in scr13o01 scr13o03 scr13o05 scr13o07; do crm node standby ${node}; done ``` ## Détection du problème : ``` scr13 RAID[0]$ show vd *************************** * Virtual Disk(s) * *************************** | Home Idx|Name | State |Pool|RAID|Capacity |Settings |Jobs |Current|Preferred -------------------------------------------------------------------------------------- 0 scratch_ost0096 AWL 0 6 360.9 TiB TReV WM VR 0(L) 0 0(L) 0 1 scratch_ost0097 READY 1 6 360.9 TiB TReV WM V 1(R) 0 1(R) 0 2 scratch_ost0098 READY 2 6 360.9 TiB TReV WM V 0(L) 1 0(L) 1 3 scratch_ost0099 READY 3 6 360.9 TiB TReV WM V 1(R) 1 1(R) 1 4 scratch_ost0100 AWL 4 6 360.9 TiB TReV WM VR 0(L) 0 0(L) 0 5 scratch_ost0101 READY 5 6 360.9 TiB TReV WM V 1(R) 0 1(R) 0 6 scratch_ost0102 READY 6 6 360.9 TiB TReV WM V 0(L) 1 0(L) 1 7 scratch_ost0103 READY 7 6 360.9 TiB TReV WM V 1(R) 1 1(R) 1 ``` Il faut ici repérer les OST en stat AWL (Auto Write Lock) Ils pointent vers le c0 (Current, valeur tout à Gauche) et socket 0 (Preferred, valeur tout à droite) ## Fin du show sub sum : ``` *************************** * Virtual Disk(s) * *************************** | Home Idx|Name | State |Pool|RAID|Capacity |Settings| Jobs |Current|Preferred -------------------------------------------------------------------------------------- 0 scratch_ost0096 AWL 0 6 360.9 TiB TReV WM VR 0(L) 0 0(L) 0 Total Virtual Disks: 1 ******************* * Pool(s) * ******************* |Total |Free |Free Spare| |Disk| Global |Spare |Member |Minimum Idx |Name |State |Capacity |Capacity |Capacity | Jobs |T/O |spare pool|Policy|Count |Rebuilds -------------------------------------------------------------------------------------------------------------------- 0 scratch_ost0096 SUBOPTIMAL 460.2 TiB 0 B 0 B VR 5 UNASSIGNED SWAP 49/51 1 Total Storage Pools: 1 *************************** * Virtual Disk(s) * *************************** | Home Idx|Name | State |Pool|RAID|Capacity |Settings| Jobs |Current|Preferred -------------------------------------------------------------------------------------- 4 scratch_ost0100 AWL 4 6 360.9 TiB TReV WM VR 0(L) 0 0(L) 0 Total Virtual Disks: 1 ******************* * Pool(s) * ******************* |Total |Free |Free Spare| |Disk| Global |Spare |Member |Minimum Idx |Name |State |Capacity |Capacity |Capacity | Jobs |T/O |spare pool|Policy|Count |Rebuilds -------------------------------------------------------------------------------------------------------------------- 4 scratch_ost0100 SUBOPTIMAL 460.2 TiB 0 B 0 B VR 5 UNASSIGNED SWAP 49/51 1 Total Storage Pools: 1 *************************** * Unassigned Pool * *************************** Total |Failed |Total| Capacity |Capacity | PDs | ---------------------------- 36.1 TiB 36.1 TiB 4 Total Unassigned Pools: 1 *************************************** * Unassigned Physical Disk(s) * *************************************** Enclosure| |S| |Health| |Block| Idx |Pos |Slot| Vendor | Product ID |Type|E|Capacity | RPM|Revision| Serial Number |Pool|State | Idx|State | WWN |Size | -------------------------------------------------------------------------------------------------------------------------------------------------- 4 4 27 HGST HUH721010AL4200 SAS 9.0 TiB 7.2K DD03 JEHEEJ6X UNAS FAILED 188 READY 5000cca26750cc5c 4K 4 4 31 HGST HUH721010AL4200 SAS 9.0 TiB 7.2K DD03 JEH15Z8X UNAS FAILED 64 READY 5000cca2673a893c 4K 4 4 50 HGST HUH721010AL4200 SAS 9.0 TiB 7.2K DD03 JEHEBHLX UNAS FAILED 71 READY 5000cca26750ae08 4K 4 4 61 HGST HUH721010AL4200 SAS 9.0 TiB 7.2K DD03 JEHEVTKX UNAS FAILED 193 READY 5000cca26751939c 4K ------------------------------------------------------------------------------------------------------------------------------------------------ | NUM| Vendor | Product ID |Type|Capacity | RPM|Revision|Block Size| -------------------------------------------------------------------------------------------------------------------------------------------------- Found: 4 HGST HUH721010AL4200 SAS 9.0 TiB 7.2K DD03 4K Il faut ici repérer les OST en stat AWL Ils pointent vers le c0 (Current, valeur tout à Gauche) et socket 0 (Preferred, valeur tout à droite) ``` ### Désactiver le Write_Back_CACHING sur tout les vd : ``` scrXX RAID[0]$ set vd 0 WRITE_BACK_CACHING false scrXX RAID[0]$ set vd 1 WRITE_BACK_CACHING false scrXX RAID[0]$ set vd 2 WRITE_BACK_CACHING false scrXX RAID[0]$ set vd 3 WRITE_BACK_CACHING false scrXX RAID[0]$ set vd 5 WRITE_BACK_CACHING false scrXX RAID[0]$ set vd 6 WRITE_BACK_CACHING false scrXX RAID[0]$ set vd 7 WRITE_BACK_CACHING false ``` ### Afficher le Dirty Cache sur les vd impactés : ``` scr13 RAID[0]$ show vd 0 all Dirty Cache: 2.1 GiB scr13 RAID[0]$ show vd 4 all Dirty Cache: 0 B ``` ### Désactivation de la politique de Vérification : ``` scr13 RAID[0]$ set SUBSYSTEM VERIFY_POLICY false SUBSYSTEM attributes set STATUS='Success' (0x0) ``` ### Vérifier les jobs en cours : ``` scr13 RAID[0]$ show job no job verify 36 FAST REBUILD POOL:4 (N/A) QUEUED 0.09 50% 2021-08-15 04:42:20 N/A 0008:58:57 37 FAST REBUILD POOL:0 (N/A) QUEUED 0.09 50% 2021-08-15 04:42:37 N/A 0008:58:40 38 FAST REBUILD POOL:4 (N/A) QUEUED 0.00 50% 2021-08-15 04:42:48 N/A 0008:58:28 39 FAST REBUILD POOL:0 (N/A) QUEUED 0.00 50% 2021-08-15 04:42:58 N/A 0008:58:19 ``` ### Stopper les jobs en cours : ``` scr13 RAID[0]$ PAUSE JOB 36 scr13 RAID[0]$ PAUSE JOB 37 scr13 RAID[0]$ PAUSE JOB 38 scr13 RAID[0]$ PAUSE JOB 39 ``` ### Supprimer le Dirty Cache (les data en mémoire du contrôleur): ``` scr13 RAID[0]$ CLEAR vd 0 AUTO_WRITE_LOCKED Are you sure you want to clear Auto Write Locked for VD 0 [Yes]? scr13 RAID[0]$ CLEAR vd 4 AUTO_WRITE_LOCKED Are you sure you want to clear Auto Write Locked for VD 4 [Yes]? ``` ### Vérifier que le Dirty Cache a bien été néttoyé : ``` scr13 RAID[0]$ CLEAR vd 0 AUTO_WRITE_LOCKED scr13 RAID[0]$ show vd 0 all Dirty Cache: 0 B scr13 RAID[0]$ show vd 4 all Dirty Cache: 0 B ``` ### Afficher de nouveau le statu des vd pour vérifier qu'ils sont tous bien en status READY : ``` scr13 RAID[0]$ show vd Idx|Name | State |Pool|RAID|Capacity |Settings |Jobs |Current|Preferred -------------------------------------------------------------------------------------- 0 scratch_ost0096 READY 0 6 360.9 TiB TReV WM R 0(L) 0 0(L) 0 1 scratch_ost0097 READY 1 6 360.9 TiB TReV M 1(R) 0 1(R) 0 2 scratch_ost0098 READY 2 6 360.9 TiB TReV M 0(L) 1 0(L) 1 3 scratch_ost0099 READY 3 6 360.9 TiB TReV M 1(R) 1 1(R) 1 4 scratch_ost0100 READY 4 6 360.9 TiB TReV WM R 0(L) 0 0(L) 0 5 scratch_ost0101 READY 5 6 360.9 TiB TReV M 1(R) 0 1(R) 0 6 scratch_ost0102 READY 6 6 360.9 TiB TReV M 0(L) 1 0(L) 1 7 scratch_ost0103 READY 7 6 360.9 TiB TReV M 1(R) 1 1(R) 1 ``` ### A partir du Tableau ci-dessus déplacer tous les vd du contrôleur impacté (ici le 0, voir Current valeur tout à Gauche) vers l'autre contrôleur (ici le 1), garder les même affinité CPU (Voir Preferred valeur toute à droite) : ``` scr13 RAID[0]$ MOVE_HOME vd 0 PROCESSOR 1 0 scr13 RAID[0]$ MOVE_HOME vd 2 PROCESSOR 1 1 scr13 RAID[0]$ MOVE_HOME vd 4 PROCESSOR 1 0 scr13 RAID[0]$ MOVE_HOME vd 6 PROCESSOR 1 1 ``` ### Vérifier de nouveau le statu des vd pour contrôler que ces derniers sont tous sur l'autre contrôleur. ``` scr13 RAID[0]$ show vd Idx|Name | State |Pool|RAID|Capacity |Settings| Jobs |Current|Preferred -------------------------------------------------------------------------------------- 0 scratch_ost0096 READY 0 6 360.9 TiB TReV WM R 1(R) 0 1(R) 0 1 scratch_ost0097 READY 1 6 360.9 TiB TReV M 1(R) 0 1(R) 0 2 scratch_ost0098 READY 2 6 360.9 TiB TReV M 1(R) 1 1(R) 1 3 scratch_ost0099 READY 3 6 360.9 TiB TReV M 1(R) 1 1(R) 1 4 scratch_ost0100 READY 4 6 360.9 TiB TReV WM R 1(R) 0 1(R) 0 5 scratch_ost0101 READY 5 6 360.9 TiB TReV M 1(R) 0 1(R) 0 6 scratch_ost0102 READY 6 6 360.9 TiB TReV M 1(R) 1 1(R) 1 7 scratch_ost0103 READY 7 6 360.9 TiB TReV M 1(R) 1 1(R) 1 ``` ### Réctiver la politique de vérification : ``` scr13 RAID[0]$ set SUBSYSTEM VERIFY_POLICY true ``` ### Vérifier les Disques physiques des UNASSIGNED_POOL pour voir l'enclosure impactée : ``` scr13 RAID[0]$ show UNASSIGNED_POOL PHYSICAL_DISKS 4 4 27 HGST HUH721010AL4200 SAS 9.0 TiB 7.2K DD03 JEHEEJ6X UNAS FAILED 188 READY 5000cca26750cc5c 4K 4 4 31 HGST HUH721010AL4200 SAS 9.0 TiB 7.2K DD03 JEH15Z8X UNAS FAILED 64 READY 5000cca2673a893c 4K 4 4 50 HGST HUH721010AL4200 SAS 9.0 TiB 7.2K DD03 JEHEBHLX UNAS FAILED 71 READY 5000cca26750ae08 4K 4 4 61 HGST HUH721010AL4200 SAS 9.0 TiB 7.2K DD03 JEHEVTKX UNAS FAILED 193 READY 5000cca26751939c 4K ``` ##### Ici l'enclosure 4 ### Clear du status UNAS FAILED sur le PD concernés : ``` scr13 RAID[0]$ clear pd 188 FAILED clear pd 64 FAILED clear pd 71 failed clear pd 193 failed ``` ### Désactiver de nouveau le WRITE BACK CACHING sur les deux vd concernés : ``` scr13 RAID[0]$ set vd 0 WRITE_BACK_CACHING false scr13 RAID[0]$ set vd 4 WRITE_BACK_CACHING false ``` ### Vérification ``` show vd Idx |Name | State |Pool|RAID|Capacity |Settings | Jobs |Current|Preferred ----------------------------------------------------------------------------------------------- 0 scratch_ost0096 READY 0 6 360.9 TiB TReV M VRC 1(R) 0 1(R) 0 1 scratch_ost0097 READY 1 6 360.9 TiB TReV M V 1(R) 0 1(R) 0 2 scratch_ost0098 READY 2 6 360.9 TiB TReV M V 1(R) 1 1(R) 1 3 scratch_ost0099 READY 3 6 360.9 TiB TReV M V 1(R) 1 1(R) 1 4 scratch_ost0100 READY 4 6 360.9 TiB TReV M VRC 1(R) 0 1(R) 0 5 scratch_ost0101 READY 5 6 360.9 TiB TReV M V 1(R) 0 1(R) 0 6 scratch_ost0102 READY 6 6 360.9 TiB TReV M V 1(R) 1 1(R) 1 7 scratch_ost0103 READY 7 6 360.9 TiB TReV M V 1(R) 1 1(R) 1 ``` ### Attendre la Fin du COPYBACK sur le PD : ``` scr13 RAID[0]$ SHOW JOB *************************** * Jobs * *************************** |Fraction| |Elapsed Idx|Type |Target (Sub) |State |Complete|Priority|Status |Start Time | End Time |Time ------------------------------------------------------------------------------------------------------------------------------------------------------------ 44 VERIFY VD:0 (N/A) RUNNING 4.90 70% 2021-08-15 13:49:46 N/A 0070:04:55 45 VERIFY VD:1 (N/A) RUNNING 4.99 70% 2021-08-15 13:49:46 N/A 0070:04:55 46 VERIFY VD:2 (N/A) RUNNING 5.02 70% 2021-08-15 13:49:46 N/A 0070:04:55 47 VERIFY VD:3 (N/A) RUNNING 5.02 70% 2021-08-15 13:49:46 N/A 0070:04:55 48 VERIFY VD:4 (N/A) RUNNING 4.90 70% 2021-08-15 13:49:46 N/A 0070:04:55 49 VERIFY VD:5 (N/A) RUNNING 4.99 70% 2021-08-15 13:49:46 N/A 0070:04:55 50 VERIFY VD:6 (N/A) RUNNING 5.02 70% 2021-08-15 13:49:46 N/A 0070:04:55 51 VERIFY VD:7 (N/A) RUNNING 5.02 70% 2021-08-15 13:49:46 N/A 0070:04:55 54 COPYBACK POOL:0 (PD:71 ) COMPLETED 100.00 10% JS_GBL_SUCCESS 2021-08-15 13:53:24 2021-08-18 09:35:53 0067:42:28 52 COPYBACK POOL:4 (PD:188 ) COMPLETED 100.00 10% JS_GBL_SUCCESS 2021-08-15 13:52:57 2021-08-18 09:16:04 0067:23:06 55 COPYBACK POOL:4 (PD:193 ) COMPLETED 100.00 10% JS_GBL_SUCCESS 2021-08-15 13:53:31 2021-08-18 08:41:46 0066:48:14 53 COPYBACK POOL:0 (PD:64 ) COMPLETED 100.00 10% JS_GBL_SUCCESS 2021-08-15 13:53:07 2021-08-18 08:37:58 0066:44:51 ``` ## Correction de l'incident et remise en service du FS. ### Extinction du contrôleur scr13c01 `opcm ddn-sfa controller-shutdown -noderange scr13c01-adm` ### Start du contrôleur scr13c01 ### Remise en place d'un VD : `MOVE_HOME vd 0 PROCESSOR 0 0` ### Monitoring de 5 minutes des erreurs SCSI ### Remise en place des 3 vd restants : ``` MOVE_HOME vd 2 PROCESSOR 0 1 MOVE_HOME vd 4 PROCESSOR 0 0 MOVE_HOME vd 6 PROCESSOR 0 1 ``` ### Monitoring d'une heure des erreurs SCSI ### Sortie du mode Standby des OSS : ``` crm node online scr13o01 crm node online scr13o03 crm node online scr13o05 crm node online scr13o07 ``` ### Remise en place du mb_c3_threshold à 90 : `opcm ddn-exascaler oss-attribute-manage -set 90 -attribute mct -noderange scr13oss` ### Planification du préload des bitmaps : #### Editer la crontab de p3mgt01 pour planifier un preload de bitmap sur les OSS de la baie concernée pour le jour même à 20h ### Sortie du mode degraded des OST : ``` [root@p3mgt01]# opcm ddn-exascaler oss-attribute-manage -noderange scr13oss -attribute 'degraded' -set 0 ```
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up