OpenEBS Jiva Discussions Notes

  1. Walk through the status update
  2. Discussion topic

Add your request, question or suggestion to our issue list

If you have anything you’d like to put on the agenda, please do so below for the next meeting:

Mar 10th

Topics for discussion:

  • Jiva Volume status being updated very frequently even in RW mode. switch instance.Status.Phase { case jv.JivaVolumePhaseReady, jv.JivaVolumePhaseSyncing: return reconcile.Result{}, r.getAndUpdateVolumeStatus(instance) }
  • Merge jiva csi and operator yaml

NFS

Pending items for winding up jiva-csi

  • upgrade github actions test(Merged)
  • Migration doc form non CSI to CSI volumes(PR to be raised)
  • Jiva Operator Documentation(PR raised under review)
  • Jiva ReadME(PR raised under review)
  • e2e PR merge/pipeline handover(waiting on pipeline fix, PR raised and reviewed)

Mar 3

  • [Shubham]

    • Add version label on all the resources(Merged)
    • operator-sdk pr under review by akhil and prateek(Merged)
    • Upgrade pending(Waiting on maya update)
    • Upgrade test cases (")
    • Non CSI to CSI migration doc (Probably next release)
    • Non CSI to CSI migration test (")
    • Remove '(' from version in jivavolume CR
    • Error logs
  • [Prateek]

    • Add jiva operator to openebs charts repo
    • Google analytics (Conflicts to be resolved)
  • [Payes]

    • Verify support for RO volumes in cstor/jiva
    • Setup e2e pipeline for jiva-operator(A new branch has been created)
    • ReadME(PR under review)
    • Docs
    • Add events to Jiva Volume CR
    • Verify if Local PV are resized when jiva Volume PVC is resized
  • [Kiran]

Feb 24

  • [Shubham]

    • Add version label on all the resources(Merged)
    • operator-sdk pr under review by akhil and prateek(Merged)
    • Upgrade pending(Waiting on maya update)
    • Upgrade test cases (")
    • Non CSI to CSI migration doc (Probably next release)
    • Non CSI to CSI migration test (")
    • Jiva Volume status being updated very frequently even in RW mode. (Fixed up increasing resync interval to 5s )
  • [Prateek]

    • Add jiva operator to openebs charts repo
    • Google analytics (Under review)
  • [Payes]

    • Verify support for RO volumes in cstor/jiva
    • Setup e2e pipeline for jiva-operator(A new branch has been created)
    • ReadME(PR under review)
    • Docs
    • Add events to Jiva Volume CR
    • Verify if Local PV are resized when jiva Volume PVC is resized
  • [Kiran]

Feb 17

  • [Shubham]

    • Jiva Volume status being updated very frequently even in RW mode.
    • Upgrade operator-sdk in jiva-operator(To be decided)
    • Upgrade (Draft PR raised) will be taken in next release
    • Upgrade test (Next release)
    • Non CSI to CSI migration doc (Probably next release)
    • Non CSI to CSI migration test (")
  • [Prateek]

    • Analytics for jiva-csi volumes
  • [Payes]

    • Verify support for RO volumes in cstor/jiva
    • Setup e2e pipeline for jiva-operator
    • Docs(PR sent to update README)
    • Add events to Jiva Volume CR
    • Verify if Local PV are resized when jiva Volume PVC is resized
  • [Kiran]

Feb 10

  • [Shubham]

    • Reviewing the helm chart PRs
    • jiva-operator doesn't reconcile if jiva-volume policy is applied later(Merged)
    • Upgrade (Draft PR raised) will be taken in next release
    • Upgrade test (Next release)
    • Non CSI to CSI migration doc (Probably next release)
    • Non CSI to CSI migration test (")
  • [Prateek]

    • Multi arch (Merged)
    • Helm chart (Merged)
    • Topology support(Merged)
  • [Payes]

    • Jiva = hash based quorum. Check for the case where 3 replicas are registerd, a non-master replica tries to reconnect with a new ID.(Verified)

    • jiva: restart while initial registration (Merged)

    • Update go version in jiva to 1.14.7(Merged)

    • raw block support(merged)

    • Resize failing for raw block volumes(k8s does not support in 1.19, need to document this after cross verifying)

    • gotgt: metrics - issue/enhancement

    • gotgt: refactor io processing to single thread with queue vs thread per io.

    • csi: ginkgo tests

    • e2e: csi based tests

    • e2e: upgrade / migration tests

    • e2e: missing resiliency tests?

Feb 3

  • [Shubham]

    • Merge operator and csi (Merged)
    • Version details on jiva-volume CR (Merged)
    • Target Affinity (Merged)
    • Upgrade (Draft PR raised) will be taken in next release
    • Upgrade test (Next release)
    • Non CSI to CSI migration doc (Probably next release)
    • Non CSI to CSI migration test (")
  • [Prateek]

    • Multi arch (Merged)
    • Helm chart (Inprogress - RC2)
  • [Payes]

    • jiva: restart while initial registration (Up for review)

    • csi: Root mount (Merged)

    • csi: Attach required as false (Merged)

    • raw block support(PR raised)

    • csi matrix for raw block(PR raised)

    • Update go version in jiva to 1.14.7(PR raised)

    • Use logrus for jiva operator (PR needs to be raised)

    • Topology support (To be added)

    • Readme Pending(RC2)

    • jiva-operator doesn't reconcile if jiva-volume policy is applied later

    • Resize failing for raw block volumes

    • gotgt: metrics - issue/enhancement

    • gotgt: refactor io processing to single thread with queue vs thread per io.

    • csi: ginkgo tests

    • e2e: csi based tests

    • e2e: upgrade / migration tests

    • e2e: missing resiliency tests?

Jan 27

  • [Shubham]

    • Merge operator and csi (Merged)
    • Version details on jiva-volume CR (In progress)
    • Upgrade (Draft PR raised)
    • Upgrade test
    • Non CSI to CSI migration doc
    • Non CSI to CSI migration test
  • [Prateek]

    • Multi arch (Inprogress)
    • Helm chart (Inprogress)
  • [Payes]

    • jiva: restart while initial registration (Merged)

    • csi: Root mount (Merged)

    • csi: Attach required as false (Merged)

    • Use logrus for jiva operator

    • Topology support (To be added)

    • Check raw block support(PR to be sent)

    • Check csi matrix(Code complete, PR will be raised)

    • gotgt: metrics - issue/enhancement

    • gotgt: refactor io processing to single thread with queue vs thread per io.

    • csi: ginkgo tests

    • e2e: csi based tests

    • e2e: upgrade / migration tests

    • e2e: missing resiliency tests?

Jan 20

  • [Shubham]

    • Merge operator and csi (Review pending)
    • Version details on jiva-volume CR
    • Upgrade
    • Node Affinity pending in Jiva Volume Policy(Not Required, dependent on local PV)
  • [Prateek]

    • Multi arch
    • Helm chart
  • [Payes]

    • jiva: restart while initial registration (PR Raised)

    • csi: Root mount (PR Raised)

    • csi: Attach required as false (PR Raised)

    • Use logrus for jiva operator

    • Review and merge https://github.com/openebs/jiva-operator/pull/28/files

    • gotgt: metrics - issue/enhancement

    • gotgt: refactor io processing to single thread with queue vs thread per io.

    • csi: ginkgo tests

    • e2e: csi based tests

    • e2e: upgrade / migration tests

    • e2e: missing resiliency tests?

Jan 13

  • [Shubham]

    • Created separate issues for tasks.
    • Working on adding version details to jivavolume CR.
    • CSI volume upgrades == will be added to openebs/upgrade
    • multi-arch builds with github actions
    • helm chart for jiva
    • Do we need an admission webhook to validate jiva volume day 2 operatoins?
      • Resize will be done via PVC
      • Replica Scale-up? will not support this.
  • [Payes]

    • jiva: restart while initial registration
    • gotgt: metrics - issue/enhancement
    • gotgt: refactor io processing to single thread with queue vs thread per io.
    • csi: Root mount
    • csi: Attach required as false
    • csi: ginkgo tests
    • e2e: csi based tests
    • e2e: upgrade / migration tests
    • e2e: missing resiliency tests?
  • Open Discussions

    • Merge jiva-csi and jiva operator code into jiva-operators
      • jiva-operators will have the CSI as well as helm charts
      • jiva-operators follow Operator SDK pattern and JivaVolumeCR is embedded into this repo.
      • Pros:
        • Avoids maintaining dependencies and E2e tests in two repos
      • Cons:
    • log framework
      • cStor CSI uses logrus
      • K8s uses klog everywhere
      • uber/zap is used in jiva operator
    • troubleshooting
      • depends on the above log framework
      • easy to integrate with tools used in the customer clusters
      • prometheus/grafana/alertmanager integrations

Jan 6

  • [Shubham]
    • Work on completion of parity u- [Payes]
    • Start working on uuid change
  • [Kiran]
    • Got permission for kubernetes-sigs repo

Dec 23

  • [Shubham/prateek]
    • Continue with feature parity
  • [Payes]
    • Clean emplty files in raw block mode(Kubernetes bug not ours)
    • Add attacher container in multi attach PR(Done)

Dec 16

Dec 2nd

  • [Payes]

    • Node failure test cases in progress

      • Learning Ansible (done)
      • Learning the current jiva pipelines and seeing the execution status
      • Raising a PR for simple change in the current jiva pipeline
      • Raise a PR to crash application node (Under review)
      • Raise a PR to crash RW replica node (Under review)
      • Raise a PR to crash WO node (Under review)
      • Work on raising a PR to litmus for node shutdown tests
      • m_upgrade or jiva-csi
    • NFS Provisioner

      • Raised a PR to fix 152.152 (done)
      • Fix the CLA(done)
      • Test for upgrades with nfs volumes that crossed 152 with previous code and users might have manually edited the exports file(done)
      • Follow up with community to merge
      • To be documented:
        • Quota doesn't work if the underlying filesystem of nfs provisioner is not xfs
      • Below obervation needs to be shared with mcKinsey(Sent):
        • If only Filesystem_id is updated volume gets deleted but a stale entry in vfs.conf is left
        • If export ID is updated, then the PV will not be deleted, as PV has annotation with old Export ID
        • To delete PV when export ID is updated, PV's annotation should also be updated.
  • [Kiran]

Nov 4th/18th, 2020

Status Updates

  • [Payes]

    • Node failure test cases in progress

      • Learning Ansible (done)
      • Learning the current jiva pipelines and seeing the execution status
      • Raising a PR for simple change in the current jiva pipeline
      • Raise a PR to crash application node (Under review)
      • Raise a PR to crash RW replica node (Under review)
      • Raise a PR to crash WO node (Under review)
      • Work on raising a PR to litmus for node shutdown tests
      • m_upgrade or jiva-csi
    • NFS Provisioner

      • Raised a PR to fix 152.152 (done)
      • Fix the CLA(done)
      • Test for upgrades with nfs volumes that crossed 152 with previous code and users might have manually edited the exports file(done)
      • Follow up with community to merge
      • To be documented:
        • Quota doesn't work if the underlying filesystem of nfs provisioner is not xfs
      • Below obervation needs to be shared with mcKinsey:
        • If only Filesystem_id is updated volume gets deleted but a stale entry in vfs.conf is left
        • If export ID is updated, then the PV will not be deleted, as PV has annotation with old Export ID
        • To delete PV when export ID is updated, PV's annotation should also be updated.
  • [Kiran]

Oct 21, 2020

Status Updates

  • [Payes]
    • Node failure test cases in progress

Oct 14, 2020

Status Updates

  • [Utkarsh]

    • PRs related to vendor updates in Jiva CSI and Jiva operator
    • CSI Sanity
  • [Payes]

    • Merged Remount, resize and volume provisioning test cases merged
    • Need to connect with gostor team (Pending)
    • Raise gotgt PR upstream(Not yet raised)
    • Node Failure Tests(Pending)
  • [Kiran]

    • NFS operator build code to be pushed
    • OpenEBS NFS Operator (Pending)

Oct 7, 2020

Status Updates

  • [Payes]

    • Remount issue(Merged)
    • Jiva CSI integration tests(In Progress) - PR pushed, travis failing
    • Need to connect with gostor team (Pending)
    • Raise gotgt PR upstream(Not yet raised)
    • Node Failure Tests(Pending)
    • NFS Operator (Pending)
  • [Utkarsh]

    • PR merged in CSI sanity tests
    • Will create PR to remote custom code for CSI sanity
  • [Kiran]

Sept 30, 2020

Status Updates

  • [Shashank]
  • [Payes]
    • License update in Jiva-csi and Jiva-operator (Done)
    • Build failures due to version mismatches (Done)- Received help from Prateek
    • Remount isuue PR raised (Under Review) - Sai is Reviewing
    • Jiva CSI integration tests under progress
    • Need to connect with gostor team (Pending)
    • Raise gotgt PR upstream(Not yet raised)
    • Node Failure Tests(Pending)
    • NFS Operator (Pending)

Sept 23, 2020

Attendees: Kiran/Payes/Shashank

Status Updates

  • [Payes]
    • License update(Done)
    • Jiva CSI Stability(InProgess)
    • Node Failure Tests(planned)
    • Raise gotgt PR upstream(InProgress)
    • NFS Operator(Not Planned)
  • [Shashank]
    • Review Node Failure tests(Planned)
    • List NFS tests on Jiva in e2e pipeline(Done)
    • Verify if Jiva Metrics are tested (Planned)
    • Jiva CSI driver related e2e tests (Not Planned)- Not present in pipeline
  • [Kiran]
    • NFS external provisioner CI automation(Planned)

Backlogs

  • Multi arch build for Jiva and Jiva CSI driver (Issue Created)
  • Recovering from complete node failure(code to be added in operator)
  • Jiva CSI and operator integration tests (Pending)
  • Update Contributor and User Documentation
  • Jiva CSI Resize feature - Supported
  • Jiva CSI Metrics - Supported
  • Enhancements to Jivactl
  • Enhancements to OpenEBS ctl to include Jiva Volumes
  • Verify Jiva Volume CR status updates and events

Sept 16, 2020

Attendees: Kiran/Payes

Status Updates

  • [Payes]
    • Update license in all files
    • Jiva CSI Stability
    • Node Failure Tests

Sept 2, 2020

Attendees: Payes

Status Updates

  • [Payes]
    • Jiva Read performance improvement done, code yet to be pushed to upstream gostor repo

Aug 26th, 2020

Attendees: Shashank, Payes, Kiran

Status Updates

  • [Shashank/Payes]
    • Jiva Reads give n*x performance numbers when no writes are being done(Not really happening)
    • Verify if metrics is tested in pipelines
    • Jiva CSI is under progress
      • Verify if Jiva over NFS tests are being run in pipeline
    • What happens if filesystem goes in RO state in pipeline
    • Test e2e replica rebuild slowness using replicated
    • Replica data loss case for dummy path in e2e

Aug 19th, 2020

Attendees: Shashank, Payes, Kiran

Status Updates

  • [Shashank/Payes]
    • 2.0 released with fixes for auto delete of internal snapshots
    • The fix is back-ported/cherry picked to 1.12.x
    • 2.0 also includes refactoring on the rebuild with check-points
    • Enhanced Travis Integration tests around check-point and data integrity tests with chaos = random and continous were executed manually
    • E2e pipelines were updated with chaos tests on controller failure.

Discussion Topics

  • 2.x Goals - Brainstorming

    • Stability of Jiva - e2e and integration tests
    • NFS on Jiva => stability and best practicies
    • Complete cluster shutdown and restart
    • Error reporting for failures and troubleshooting guides
    • Support bundle - take application name as input and generate a zip of config and logs
    • Jiva user guides
    • CSI Driver with remount feature
    • CSI Driver with Replica with Local PV hostpath
    • Workload cluster = jiva
    • Support from Kubera for Jiva
    • Performance comparision b/w single and parallel IO
  • [Refactor E2e snapshot delete test]

    • Current test - trigger the delete and verify data integrity while deletion is in progress

    • controller to replica connection happens - restart master replica. secondary replicas go to crash-loop-back-off. takes time for replica to reconcile.

      • master comes back online after t1
      • replica1 comes back online after t1+2s == starts the sync process
      • master restarted again at t1+1s while sync is in progress
    • Test Details:

      • kill master and wait for new master to take over and bring replicas to RW state
    • R1 (RW), R2 (RW), R3 (WO)

      • 3 replicas are connected
      • Deployed application
    • R1 (RW), R2 (RW), R3 (WO) => Restart R1

      • 1 mins (wait for 1 min)
      • 5 mins (identify new master)
      • x mins (wait for RW, RW, WO)
    • R1 (NA), R2 (RW), R3 (NA) =>

    • R1 (NA), R2 (RW), R3 (WO) =>

    • R1 (WO), R2 (RW), R3 (RW) => Restart R15

    • R1 (NA), R2 (WO), R3 (RW) =>

    • R1 (WO), R2 (RW), R3 (RW) =>

    • 40 snapshots or 20 snapshots

    • R1 (WO), R2 (NA), R3 (RW) => ( 15 min ==> n snapshots * ~copy time + delta R1 CLBTimer)

    • R1 (RW), R2 (WO), R3 (RW) => ( 15 min ==> n snapshots * ~copy time + delta R2 CLBTimer)

    • R1 (RW), R2 (RW), R3 (RW) =>

      • Verify that all replicas are RW ( 15 min ==> n snapshots * 2 * ~copy time )
      • Wait till snaps become 12 ( 15 min ==> n or restarts * 1 mins? )

May 13th, 2020

Attendees: Utkarsh, Payes, Vitta, Giri

Status Updates

  • [Payes]

    • Review pending PR's
    • Help somesh on preload tests
    • Work on preload test scenario
  • [Utkarsh]

    • Re-enable auto-snap deletion (design is done)
    • Re-architect jiva rebuilding and registration (design in progress)
  • [Somesh]

    • e2e for pre-load optimizations (done)
    • jiva replica scaleup refactoring (done)

Discussion Topics

  • [Utkarsh] Failure during a resync could cause data loss
    • Rebuilding and registration process needs architecture change

May 6th, 2020

Attendees: Utkarsh, Payes, Vitta, Kiran

Status Updates

  • [Payes]

    • Travis CI failing (blocker for RC1)
  • [Utkarsh]

  • [Somesh]

    • e2e for pre-load optimizations (in-progress)

Discussion Topics

  • [Utkarsh] Failure during a resync could cause data loss
    • During resync, holes in images files are getting synced
    • If there a failure after a paritial sync of sparse files, the slave can become master.
    • This issue is triggered mainly due to either automatic or manual snap deletion

April 29, 2020

Attendees: Somesh, Utkarsh, Payes, Vitta, Kiran

Status Updates

  • [Somesh]
    • Automated cleanup job testing (review is pending)
    • Manually verified flushing logs to file feature and automated it (review is pending)
  • [Utkarsh]
    • Proposed the improvement/fix for two child issue
    • Sync progress feature has been merged
    • Sync progress on clone volume (review is in progress)
    • Configuring sync server with custom http timeouts or retry (WIP)
  • [Payes]
    • Preload fix has been merged (degraded replica is getting restarted since preload was taking time when write IO's continues)
    • Work with somesh to validate the fix

Discussion Topics

  • [Vitta]

    • Review Backlog at https://github.com/orgs/openebs/projects/1
      • Prioritize auto-snap deletion. Start with the design
      • jiva replica under IO pressure will be taken up by payes. work with utkarsh.
      • jivactl two child is not a concern, so move to near term
  • [Utkarsh]

    • Should a release tag be created on openebs/sparse-tools?
      • create a tag 1.0 and use that in jiva.
      • update dependencies in a different commit helps to know what changes went in.
  • [Giri/Somesh]

    • Functional Testing on Jiva
      • Provisioning of Jiva. Verify setting up and cleaning of various resources (done)
      • Provisioning Jiva with XFS (instead of default Ext4)
      • Volume Metrics from Jiva Target and Replicas
    • Application Testing on Jiva
      • Running Percona/MySQL on Raw Block Device with Jiva
      • Running MongoDB on Jiva.
      • Backup and Restore of PostgreSQL on Jiva
      • Application pods with Cross AZ functionality
    • Chaos Testing on Jiva
      • Provisioners, Jiva Target and Replica (podkill) during provisioning and deletion of PVCs.
      • Application pod kill
      • Node Failure/Restart
      • Node Delete (or changing the hostname of the node) and handle the graceful application pod rescheduling/creating new pod.
    • Scale Testing on Jiva
      • Test for capacity threshold crossing
      • Test for underlying device running out of capacity
      • Test for 50, 100, 200 volumes
    • Soak Testing on Jiva
      • Setup long running clusters - soak test bed
      • Postgresql on Jiva
      • MongoDB on Jiva
      • Use data generators to keep pumping data
      • Introduce occasional choas as part of build/release pipelines into the soaktest bed cluster on these applications.
Select a repo