Snehal Hodage
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Understand Ceph Upgrades (Brownfield ) [TM#195](https://github.com/airshipit/treasuremap/issues/195) [TOC] ## Investigation and plan for an upgrade of Ceph on an existing cluster This POC will determine the proposed plan to upgrade an existing ceph cluster from v15.2.13 to v16.2.6. It will also identify the required sequence, dependencies and constraints. As well as any impacts to cluster availability and performance during an upgrade. Assuming that the original cluster was deployed using rook-ceph operator, the brownfield scenario could consist of two independent steps: * Operator upgrade * Ceph upgrade Both steps could be accomplished in any sequence by following set of rules listed below: 1. Before upgrade the ceph cluster should be in a healthy state. It is possible (but not recommended) to perform an upgrade on a cluster which has some warnings alarms, however in this event a person responsible for maintenance should make a decision. Below are some examples of warnings when we still can proceed with upgrades: * some osds are permanently out/down because of drives errors. * some of PGs are in peering/waiting state because of scrubbing or deep scrubbing * there are some PGs that are not scrubbed in time However, warnings like : * osd almost full * osds are flapping and similar Should be considered as a red flag for the brownfield upgrade. To summarize warnings listed above - there should be made a human decision about warning severity. 2. The upgrade should be performed within two major releases, e.g: rook 1.6 -> 1.7 and/or ceph 15.x -> ceph 16.x. It is recommended to upgrade ceph to the latest minor release before performing a major release upgrade. 3. Planning ceph upgrade to the next major release, it is recommended to perform the operator upgrade first. Usually, rook operator supports three major ceph releases N-1, N and N+1, e.g: rook 1.7 supports Nautilus, Octopus and Pacific. 1. It is possible to perform downgrade, as well. For the ceph the downgrade was tested between minor releases. Performing downgrade the attention should be paid to the ceph release notes. We can downgrade between bug-fix releases, but feature releases shouldn’t be downgraded under any circumstances. As an example, latest octopus should not be downgraded to the previous minor versions because of data base schematics change. 5. Different upgrade scenarios performed in the local lab confirming that there are no significant performance or availability impacts. The operator upgrade doesn’t affect a ceph functionality, according to the rook documentation, the ceph cluster remains fully functional with only minimal limitations. The performance impact during the ceph upgrade is absolutely comparable to the impact triggered by regular maintenance like osd node reboot or hard drive replacement. This level of impact is expected and well documented. To summarize the above statement, both brownfield operations are harmless for the cluster. ## Ceph Upgrade Process In this scenario, we will upgrade ceph from v15.2.13 to v16.2.6 ### Pre-requisites and health status Initial status of the ceph cluster : ```airship@d105:~$ kubectl get cephclusters.ceph.rook.io -n rook-ceph NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL rook-ceph /var/lib/rook 3 6d23h Ready Cluster created successfully HEALTH_OK ``` ``` kubectl exec -n rook-ceph rook-ceph-tools-65c94d77bb-6czmn -- ceph status cluster: id: 0b59ebfb-2e36-45aa-af62-02e1d41cc2e6 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 6d) mgr: a(active, since 6d) osd: 3 osds: 3 up (since 6d), 3 in (since 6d) data: pools: 2 pools, 33 pgs objects: 64 objects, 158 MiB usage: 3.5 GiB used, 15 GiB / 18 GiB avail pgs: 33 active+clean ``` Verify Rook Operator Version: ``` airship@d105:~$ kubectl get deployments.apps rook-ceph-operator -n rook-ceph -o=custom-columns="NAME:.metadata.name,IMAGE:.spec.template.spec.containers[*].image" NAME IMAGE rook-ceph-operator rook/ceph:v1.7.11 ``` Ceph Version Upgrades: Rook v1.7 supports the following Ceph versions: * Ceph Pacific 16.2.0 or newer * Ceph Octopus v15.2.0 or newer * Ceph Nautilus 14.2.5 or newer Existing Ceph Version in the Cluster: ``` airship@d105:~$ kubectl get deployments.apps -n rook-ceph -o=custom-columns="NAME:.metadata.name,IMAGE:.spec.template.spec.containers[*].image" NAME IMAGE csi-cephfsplugin-provisioner k8s.gcr.io/sig-storage/csi-attacher:v3.3.0,k8s.gcr.io/sig-storage/csi-snapshotter:v4.2.0,k8s.gcr.io/sig-storage/csi-resizer:v1.3.0,k8s.gcr.io/sig-storage/csi-provisioner:v3.0.0,quay.io/cephcsi/cephcsi:v3.4.0,quay.io/cephcsi/cephcsi:v3.4.0 csi-rbdplugin-provisioner k8s.gcr.io/sig-storage/csi-provisioner:v3.0.0,k8s.gcr.io/sig-storage/csi-resizer:v1.3.0,k8s.gcr.io/sig-storage/csi-attacher:v3.3.0,k8s.gcr.io/sig-storage/csi-snapshotter:v4.2.0,quay.io/cephcsi/cephcsi:v3.4.0,quay.io/cephcsi/cephcsi:v3.4.0 rook-ceph-crashcollector-node03 ceph/ceph:v15.2.13 rook-ceph-crashcollector-node04 ceph/ceph:v15.2.13 rook-ceph-crashcollector-node05 ceph/ceph:v15.2.13 rook-ceph-mgr-a ceph/ceph:v15.2.13 rook-ceph-mon-a ceph/ceph:v15.2.13 rook-ceph-mon-b ceph/ceph:v15.2.13 rook-ceph-mon-c ceph/ceph:v15.2.13 rook-ceph-operator rook/ceph:v1.7.11 rook-ceph-osd-0 ceph/ceph:v15.2.13 rook-ceph-osd-1 ceph/ceph:v15.2.13 rook-ceph-osd-2 ceph/ceph:v15.2.13 ``` ``` airship@d105:~$ kubectl get pods -n rook-ceph NAME READY STATUS RESTARTS AGE csi-cephfsplugin-8f2zm 3/3 Running 0 6d23h csi-cephfsplugin-d27bl 3/3 Running 0 6d23h csi-cephfsplugin-kmz8j 3/3 Running 0 6d23h csi-cephfsplugin-provisioner-689686b44-bfpzp 6/6 Running 0 6d23h csi-cephfsplugin-provisioner-689686b44-d699m 6/6 Running 0 6d23h csi-rbdplugin-9dsst 3/3 Running 0 6d23h csi-rbdplugin-fw2nk 3/3 Running 0 6d23h csi-rbdplugin-provisioner-5775fb866b-7fng8 6/6 Running 0 6d23h csi-rbdplugin-provisioner-5775fb866b-7r4xf 6/6 Running 0 6d23h csi-rbdplugin-rs2w8 3/3 Running 0 6d23h rook-ceph-crashcollector-node03-df5fccdc4-xj44l 1/1 Running 0 6d23h rook-ceph-crashcollector-node04-7d5b4dd9df-8pzz7 1/1 Running 0 6d23h rook-ceph-crashcollector-node05-77d88cf7bd-fbdfp 1/1 Running 0 6d23h rook-ceph-mgr-a-84855f9b9d-wg8vd 1/1 Running 0 6d23h rook-ceph-mon-a-5cb4fbdf47-wgh2w 1/1 Running 0 7d1h rook-ceph-mon-b-88d5c7db6-7n9kc 1/1 Running 0 7d1h rook-ceph-mon-c-cdf7b8bc-zx5wt 1/1 Running 0 7d1h rook-ceph-operator-8595fc774f-gr75s 1/1 Running 0 6d23h rook-ceph-osd-0-c5cccc678-ptvlv 1/1 Running 0 6d23h rook-ceph-osd-1-5d7f769f5d-w5bhk 1/1 Running 0 6d23h rook-ceph-osd-2-799c7ddb87-wg68x 1/1 Running 0 6d23h rook-ceph-osd-prepare-node03-gd2nf 0/1 Completed 0 127m rook-ceph-osd-prepare-node04-59m72 0/1 Completed 0 126m rook-ceph-osd-prepare-node05-pqzss 0/1 Completed 0 126m rook-ceph-tools-65c94d77bb-6czmn 1/1 Running 0 6d23h ``` Cluster with one master and three worker nodes: ``` airship@d105:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION node01 Ready control-plane,master 12d v1.21.2 node03 Ready <none> 7d1h v1.21.2 node04 Ready <none> 7d1h v1.21.2 node05 Ready <none> 12d v1.21.2 ``` Each of these worker nodes has two disks configured; one which runs the OS (root disk) and one which is going to be used for the Ceph storage. The below output shows the storage available, which is exactly the same on each host. /dev/sda is the root partition containing the OS install and /dev/sdb is an untouched partition which will be used for Ceph. ``` deployer@node05:~$ sudo fdisk -l | grep /dev/sd Disk /dev/sda: 30 GiB, 32212254720 bytes, 62914560 sectors /dev/sda1 2629632 62781439 60151808 28.7G Linux filesystem /dev/sda2 2048 10239 8192 4M BIOS boot /dev/sda3 10240 1056767 1046528 511M EFI System /dev/sda4 1056768 2629631 1572864 768M Linux filesystem /dev/sda5 62781440 62914526 133087 65M Linux filesystem Disk /dev/sdb: 10 GiB, 10737418240 bytes, 20971520 sectors /dev/sdb1 2048 12584959 12582912 6G 83 Linux ``` ### Steps for Ceph Upgrade 1. Update the main Ceph daemons Begin the upgrade by changing the Ceph image field in the cluster CRD (spec.cephVersion.image). ``` NEW_CEPH_IMAGE='quay.io/ceph/ceph:v16.2.6-20210918' CLUSTER_NAME="$ROOK_CLUSTER_NAMESPACE" # change if your cluster name is not the Rook namespace kubectl -n $ROOK_CLUSTER_NAMESPACE patch CephCluster rook-ceph --type=merge -p "{\"spec\": {\"cephVersion\": {\"image\": \"$NEW_CEPH_IMAGE\"}}}" ``` 2. Wait for the daemon pod updates to complete Status can be determined in a similar way to the Rook upgrade as well. ``` watch --exec kubectl -n $ROOK_CLUSTER_NAMESPACE get deployments -l rook_cluster=$ROOK_CLUSTER_NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{" \treq/upd/avl: "}{.spec.replicas}{"/"}{.status.updatedReplicas}{"/"}{.status.readyReplicas}{" \tceph-version="}{.metadata.labels.ceph-version}{"\n"}{end}' ``` 3. Wait for the upgrade to complete Ceph mons, mgr, osds are terminated and replaced with updated versions in sequence. The cluster may be offline very briefly as mons update, and the Ceph Filesystem may fall offline a few times while the MDSes are upgrading. This is normal. The versions of the components can be viewed as they are updated: ``` kubectl get deployments.apps -n rook-ceph -o=custom-columns="NAME:.metadata.name,IMAGE:.spec.template.spec.containers[*].image" ``` After upgrade: Pods are created with new ceph image version ``` rook-ceph-crashcollector-node03-796bc855d5-4lmvd 1/1 Running 0 16m rook-ceph-crashcollector-node04-7949c4dddb-djdnf 1/1 Running 0 13m rook-ceph-crashcollector-node05-f9c854567-mcz78 1/1 Running 0 15m rook-ceph-mgr-a-bcddcb64b-r9pr2 1/1 Running 0 13m rook-ceph-mon-a-6878cc4679-c97cc 1/1 Running 0 15m rook-ceph-mon-b-76584cf74b-6txkl 1/1 Running 0 13m rook-ceph-mon-c-58d994876c-98hvh 1/1 Running 0 16m rook-ceph-operator-8595fc774f-gr75s 1/1 Running 0 7d rook-ceph-osd-0-64cb7bb64f-lvrhc 1/1 Running 0 12m rook-ceph-osd-1-5587cf66f9-th6b2 1/1 Running 0 12m rook-ceph-osd-2-8fbb84756-z2mqc 1/1 Running 0 12m rook-ceph-osd-prepare-node03-cffpg 0/1 Completed 0 13m rook-ceph-osd-prepare-node04-wr465 0/1 Completed 0 12m rook-ceph-osd-prepare-node05-qzz79 0/1 Completed 0 12m ``` ``` rook-ceph-crashcollector-node03 quay.io/ceph/ceph:v16.2.6-20210918 rook-ceph-crashcollector-node04 quay.io/ceph/ceph:v16.2.6-20210918 rook-ceph-crashcollector-node05 quay.io/ceph/ceph:v16.2.6-20210918 rook-ceph-mgr-a quay.io/ceph/ceph:v16.2.6-20210918 rook-ceph-mon-a quay.io/ceph/ceph:v16.2.6-20210918 rook-ceph-mon-b quay.io/ceph/ceph:v16.2.6-20210918 rook-ceph-mon-c quay.io/ceph/ceph:v16.2.6-20210918 rook-ceph-operator rook/ceph:v1.7.11 rook-ceph-osd-0 quay.io/ceph/ceph:v16.2.6-20210918 rook-ceph-osd-1 quay.io/ceph/ceph:v16.2.6-20210918 rook-ceph-osd-2 quay.io/ceph/ceph:v16.2.6-20210918 ``` 4. Verify the updated Cluster ``` airship@d105:~$ kubectl -n $ROOK_CLUSTER_NAMESPACE get deployment -l rook_cluster=$ROOK_CLUSTER_NAMESPACE -o jsonpath='{range .items[*]}{"ceph-version="}{.metadata.labels.ceph-version}{"\n"}{end}' | sort | uniq ceph-version=16.2.6-0 ``` ## Observations 1. When we ceph upgrade , ceph components like crashcollectors , mons, osds, mgr are upgraded to the latest version. 2. OSD went down one after another from a single node at a time out of 3 nodes. So there was always 2 OSDs which preserve quoram. 3. Ceph cluster health went to Warn state once the OSD were down for a while. Post upgrade OSD came up and ceph cluster health back to Health OK. 4. Node reboot not required post ceph upgrade. 5. Ceph upgrade does not have any impact on the rook operator. 6. Different upgrade scenarios performed in the local lab confirming that there are no significant performance or availability impacts. The ceph upgrade doesn't affect a rook functionality, according to the rook documentation, the ceph cluster remains fully functional with only minimal limitations. The performance impact during the ceph upgrade is absolutely comparable to the impact triggered by regular maintenance like osd node reboot or hard drive replacement. This level of impact is expected and well documented. #### Performance Impact during upgrade process: Initial Ceph Status: ``` kubectl exec -n rook-ceph rook-ceph-tools-65c94d77bb-6czmn -- ceph status cluster: id: 0b59ebfb-2e36-45aa-af62-02e1d41cc2e6 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 6d) mgr: a(active, since 6d) osd: 3 osds: 3 up (since 6d), 3 in (since 6d) data: pools: 2 pools, 33 pgs objects: 64 objects, 158 MiB usage: 3.5 GiB used, 15 GiB / 18 GiB avail pgs: 33 active+clean ``` One OSDs went down , Ceph mons also go down one at a time during the upgrade , cluster shows warning status: ``` health: HEALTH_WARN 1 osds down 1 host (1 osds) down Degraded data redundancy: 65/195 objects degraded (33.333%), 28 pgs degraded data: pools: 2 pools, 33 pgs objects: 65 objects, 158 MiB usage: 516 MiB used, 17 GiB / 18 GiB avail pgs: 24.242% pgs not active 49/195 objects degraded (25.128%) 22 active+undersized+degraded 8 peering 3 active+undersized ``` After OSDs are upgraded to latest version: There is no impact on the data as the write operation performed in db is still intact. ``` airship@d105:~$ kubectl exec -n rook-ceph rook-ceph-tools-65c94d77bb-6czmn -- ceph status cluster: id: 0b59ebfb-2e36-45aa-af62-02e1d41cc2e6 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 16h) mgr: a(active, since 18h) osd: 3 osds: 3 up (since 18h), 3 in (since 7d) data: pools: 2 pools, 96 pgs objects: 104 objects, 248 MiB usage: 1.0 GiB used, 17 GiB / 18 GiB avail pgs: 96 active+clean io: client: 4.7 KiB/s wr, 0 op/s rd, 0 op/s wr ``` Write Operation intact after CEPH upgrade: ``` airship@d105:~$ kubectl exec -n rook-ceph rook-ceph-tools-65c94d77bb-6czmn -- ceph osd status ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE 0 node03 428M 5716M 0 0 0 0 exists,up 1 node04 447M 5696M 0 6552 0 0 exists,up 2 node05 446M 5697M 0 0 0 0 exists,up ```

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully