TiDB Operator Developing Guide for beginners

# TiDB Operator Developing Guide for beginners It's a guide for beginners to develope TiDB Operator. It will help you find the way to make some changes on TiDB Operator as you want. ## The structure of the code repository of TiDB Operator You need to get hang of these code at the begnning. It will become a good start for your development. ```mermaid graph LR /-->/cmd(/cmd the entry function, you will run 'go build' with these files) /-->/pkg(/pkg package the implementation code of TiDB Operator) ``` There are some code you needn't care when you begin to develope TiDB Operator. The function of other directories is listed below. ```mermaid graph LR /-->images(/images images, temporally store compile files) /-->charts(/charts Helm charts with some TiDB Operator deployment configuration some files are intermediate products of the transition from docker-compose to Kubernetes, and these will be discarded in the future) /-->examples(/examples samples for TiDB Operater usages) /-->manifest(/manifest cheatsheet) /-->marketplace(/marketplace some files related with GCP marketplace) /-->misc(/misc small tools for debug) /-->ci(/ci CI scripts) /-->tests(/tests testcases) /-->static(/static pictures and some resources files) /-->tools(/tools tools) ``` ## Compile, run, analyse the implementation of TiDB Operator ### Compile the entry files When you started developing TiDB Operator, you may find it difficult to find the beginning of the code repository or you may become confused about the structure of the code. Let's analyze the Makefile first. ```Makefile build: controller-manager scheduler discovery admission-webhook apiserver backup-manager controller-manager: $(GO_BUILD) -ldflags '$(LDFLAGS)' -o images/tidb-operator/bin/tidb-controller-manager cmd/controller-manager/main.go scheduler: $(GO_BUILD) -ldflags '$(LDFLAGS)' -o images/tidb-operator/bin/tidb-scheduler cmd/scheduler/main.go discovery: $(GO_BUILD) -ldflags '$(LDFLAGS)' -o images/tidb-operator/bin/tidb-discovery cmd/discovery/main.go admission-webhook: $(GO_BUILD) -ldflags '$(LDFLAGS)' -o images/tidb-operator/bin/tidb-admission-webhook cmd/admission-webhook/main.go apiserver: $(GO_BUILD) -ldflags '$(LDFLAGS)' -o images/tidb-operator/bin/tidb-apiserver cmd/apiserver/main.go backup-manager: $(GO_BUILD) -ldflags '$(LDFLAGS)' -o images/tidb-backup-manager/bin/tidb-backup-manager cmd/backup-manager/main.go ``` In this part, we can see some tools we know before. `controller-manager` is the core function of TiDB Operator, it supports the lifecycle management of TiDB Operator. `scheduler` extends the kube-scheduler. It adds some unique logic of TiDB to Kubernetes, such as some instances place rules, every node just host one tikv pod, etc. `discovery` is related to service discovery, service registration and configuration issuance. `admission-webhook` replaces the basic logic of Kubernestes, it's alertnative. `apiserver` is related to some apis of TiDB Operator. `backup-manager` is related to some functions of backups. In one word, when you begin to develope TiDB Operator, you can just go through the `controller-manager`, get hang of the whole lifecycle of one basic TiDB compoment. We will investigate what TiKV controller has done to help you get used to the development. > Note: > > When you are debuging the TiDB Operator, you can just compile the components you are developing by editing Makefile, such as only keep controller-manager in `build` stage. ### What happens when a TiDB Operator launched When a TiDB Operator is deployed, launched by Kubernetes, TiDB Operator will start many controllers to monitor CRDs defined by TiDB Operator. These controllers will begin to watch the related event to the responsible CRD, such as create, modify, delete. You can see this from the logs of the pod `tidb-controller-manager`. ```log I0915 07:27:18.535396 1 main.go:100] FLAG: --workers="5" I0915 07:27:18.559733 1 leaderelection.go:241] attempting to acquire leader lease tidb-admin/tidb-controller-manager... I0915 07:27:35.536096 1 leaderelection.go:251] successfully acquired lease tidb-admin/tidb-controller-manager I0915 07:27:35.538147 1 upgrader.go:106] Upgrader: APIGroup apps.pingcap.com is not registered, skip checking Advanced Statfulset I0915 07:27:37.441210 1 main.go:215] cache of informer factories sync successfully I0915 07:27:37.441252 1 tidb_cluster_controller.go:276] Starting tidbcluster controller I0915 07:27:37.441285 1 backup_controller.go:119] Starting backup controller I0915 07:27:37.441301 1 restore_controller.go:118] Starting restore controller I0915 07:27:37.441317 1 backup_schedule_controller.go:114] Starting backup schedule controller I0915 07:27:37.441331 1 tidb_initializer_controller.go:105] Starting tidbinitializer controller I0915 07:27:37.441349 1 tidb_monitor_controller.go:92] Starting tidbmonitor controller ``` ```mermaid graph LR tidb-controller-manager(tidb-controller-manager started)-->pkg/controller/tidbcluster/tidb_cluster_controller.go tidb-controller-manager(tidb-controller-manager started)-->backup_controller tidb-controller-manager(tidb-controller-manager started)-->restore_controller tidb-controller-manager(tidb-controller-manager started)-->backup_schedule_controller tidb-controller-manager(tidb-controller-manager started)-->tidb_initializer_controller tidb-controller-manager(tidb-controller-manager started)-->tidb_monitor_controller ``` How these controller started? In `main` function, controller-manager invokes the `NewController` function defined in every controller. ```go tcController := tidbcluster.NewController(kubeCli, cli, genericCli, informerFactory, kubeInformerFactory, autoFailover, pdFailoverPeriod, tikvFailoverPeriod, tidbFailoverPeriod, tiflashFailoverPeriod) dcController := dmcluster.NewController(kubeCli, cli, genericCli, informerFactory, kubeInformerFactory, autoFailover, masterFailoverPeriod, workerFailoverPeriod) backupController := backup.NewController(kubeCli, cli, informerFactory, kubeInformerFactory) restoreController := restore.NewController(kubeCli, cli, informerFactory, kubeInformerFactory) bsController := backupschedule.NewController(kubeCli, cli, informerFactory, kubeInformerFactory) tidbInitController := tidbinitializer.NewController(kubeCli, cli, genericCli, informerFactory, kubeInformerFactory) tidbMonitorController := tidbmonitor.NewController(kubeCli, genericCli, cli, informerFactory, kubeInformerFactory) tidbGroupController := tidbgroup.NewController(kubeCli, cli, informerFactory, kubeInformerFactory) tikvGroupController := tikvgroup.NewController(kubeCli, cli, genericCli, informerFactory, kubeInformerFactory) ``` ### How the tidbcluster controller manages the lifecycle of TiDB cluster `tidbcluster controller` is the first controller we need to investigate. It supports the jobs about the TiDB cluster lifecycle management. Other controllers are related to some backup and data migration, we will discuss there components later. How to register one event management of TiKV in `tidb-controller-manager`? There are many control logic included in `control: NewDefaultTidbClusterControl`. The `tidbcluster` is managed by `tcControl`, which invoked in the beginning of the `TiDBClusterControl`: ```go control: NewDefaultTidbClusterControl( tcControl, ... ) ``` The subcomponent logic definitions are invoked in structures code like `mm.NewTiKVMemberManager` in `pkg/controller/tidbcluster/tidb_cluster_controller.go`: ```go mm.NewTiKVMemberManager( pdControl, setControl, svcControl, typedControl, setInformer.Lister(), svcInformer.Lister(), podInformer.Lister(), nodeInformer.Lister(), autoFailover, tikvFailover, tikvScaler, tikvUpgrader, recorder, ), ``` The function `NewTiKVMemberManager` is implemented in `pkg/manager/member/tikv_member_manager.go`, we will introduce the details of TiKV cluster in the following part. There are some kubernetes resources controller and tikv lifecycle controller listed. Some resources change listeners have also been registrated when we creating `TiKVMemberManager`, such as `statefulsets informer`. It is defined in this line. ```go setInformer := kubeInformerFactory.Apps().V1().StatefulSets() ``` This line invokes the function in Kubernetes SDK. And it is invoked by: ```go setInformer.Lister(), ``` > Note: > > We could find some entry functions by searching the functions belong to Kubernetes SDK. By doing this, once you apply a yaml to kubernetes, TiDB Operator will response to the changes of the `spec`. Such as you modify the version of TiDB Cluster, and you apply this yaml file. You will find that the TiDB cluster managed by TiDB Operator is upgrading. The related components are automatically replaced by the new version of the components. The controllers are invoked by `Run` function, which definded in every controller. ```go onStarted := func(ctx context.Context) { ... ... ... go wait.Forever(func() { dcController.Run(workers, ctx.Done()) }, waitDuration) go wait.Forever(func() { backupController.Run(workers, ctx.Done()) }, waitDuration) go wait.Forever(func() { restoreController.Run(workers, ctx.Done()) }, waitDuration) go wait.Forever(func() { bsController.Run(workers, ctx.Done()) }, waitDuration) go wait.Forever(func() { tidbInitController.Run(workers, ctx.Done()) }, waitDuration) go wait.Forever(func() { tidbMonitorController.Run(workers, ctx.Done()) }, waitDuration) go wait.Forever(func() { tidbGroupController.Run(workers, ctx.Done()) }, waitDuration) go wait.Forever(func() { tikvGroupController.Run(workers, ctx.Done()) }, waitDuration) if controller.PodWebhookEnabled { go wait.Forever(func() { periodicityController.Run(ctx.Done()) }, waitDuration) } if features.DefaultFeatureGate.Enabled(features.AutoScaling) { go wait.Forever(func() { autoScalerController.Run(workers, ctx.Done()) }, waitDuration) } wait.Forever(func() { tcController.Run(workers, ctx.Done()) }, waitDuration) } ``` Now, the lifecycle of TiDB ecosystem is managed by the TiDB Operator. When you create a CRD like tidbcluster, apply changes to TiDB cluster, upgrade the TiDB cluster, `controller-manager` and other controller in TiDB Operator will response your request, fulfill your willing. ## When applying changes to the tidbcluster When you create a CRD `tidbcluster` yaml or just modify, and you apply this yaml with `kubectl`. The TiDB Operator will validate the yaml file, and do a sequence of updating. In the process, the funcions included in `...MemberManager` are conducted by the subcomponents in `pkg/manager/member`. ```mermaid graph TD tidbcluster(tidbcluster yaml create or update) tcInformer(tcInformer.Listen) tidbcluster-->|watched by|tcInformer tcInformer-->|invoke pkg/controller/tidbcluster/tidb_cluster_control.go|UpdateTidbCluster defaulting validate updateTidbCluster subgraph prechecks UpdateTidbCluster-->defaulting defaulting-->validate validate-->updateTidbCluster end subgraph apply/reconcile reclaimPolicyManager orphanPodsCleaner discoveryManager podRestarter pdMemberManager tikvMemberManager pumpMemberManager tidbMemberManager tiflashMemberManager ticdcMemberManager metaManager pvcResizer tidbClusterStatusManager updateTidbCluster-->|invoke functions in pkg/manager/member|reclaimPolicyManager reclaimPolicyManager-->orphanPodsCleaner orphanPodsCleaner-->discoveryManager discoveryManager-->podRestarter podRestarter-->pdMemberManager pdMemberManager-->tikvMemberManager tikvMemberManager-->pumpMemberManager pumpMemberManager-->tidbMemberManager tidbMemberManager-->tiflashMemberManager tiflashMemberManager-->ticdcMemberManager ticdcMemberManager-->metaManager metaManager-->pvcResizer pvcResizer-->tidbClusterStatusManager tidbClusterStatusManager-->|reconcile|reclaimPolicyManager tidbClusterStatusManager-->|until status==spec|continue end ``` The `tidbcluster controller` will run the process of `apply` round by round until the `status` equals to `spec`. For details about how subcomponents work, refer to the section of `How to implement the lifecycle management of TiKV`. In `pkg/controller/tidbcluster/tidb_cluster_controller.go`, there is a queue to avoid controller blocked. When changes applied, these changes will add to `cache` to wait for `workers` to handle the process. ```go tcInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: tcc.enqueueTidbCluster, UpdateFunc: func(old, cur interface{}) { tcc.enqueueTidbCluster(cur) }, DeleteFunc: tcc.enqueueTidbCluster, }) ``` ```go for i := 0; i < workers; i++ { go wait.Until(tcc.worker, time.Second, stopCh) } func (tcc *Controller) worker() { for tcc.processNextWorkItem() { } } ``` By doing this, we can process the changes fluently when a lot of changes applied. ## How to implement the lifecycle management of TiKV In the content above, we introduced how `tidb-controller-manager` is build and how to run the controllers. These contents explain which components launched the TiKV controller, how these controller registered in the system, and how the changes event trigger the related controller. The `controller` takes the responsibility of the lifecycle management of the related component. When they find the user changes the demand, such as the user want to upgrade the cluster, they will set off some operations to fulfill the users demand. Take `tikvMemberManager` for example, a subcomponent is implemented in `pkg/manager/member`, and it also implements the logic about how to `sync` the changes. The main functions of `tikv` are included in `pkg/manager/member/tikv_member_manager.go`. There is a sequence of judgement to decide whether to set off the related lifecycle event. When process some requests about `scale`, `upgrade` and `failover`, `tikv_member_manager.go` will import the related functions in these files. ```mermaid graph TD tikvMemberManager-->specified(evovled in specification?) specified-->|no|return(return) specified-->|yes|PDavailabel(Is PD available?) PDavailabel-->|no,err|return PDavailabel-->|yes|syncServiceForTidbCluster(syncServiceForTidbCluster) syncServiceForTidbCluster-->service-existsed(Is service existed?) service-existsed-->|no|CreateService(svcControl.CreateService) service-existsed-->|yes|syncStatefulSetForTidbCluster(syncStatefulSetForTidbCluster) CreateService-->syncStatefulSetForTidbCluster syncStatefulSetForTidbCluster-->Paused(Paused?) Paused-->|yes,err|return(return) Paused-->|no && if statefulsets nonexsisted|CreateStatefulSet(CreateStatefulSet) CreateStatefulSet-->setStoreLabelsForTiKV(setStoreLabelsForTiKV) setStoreLabelsForTiKV-->tikvScaler.Scale(tikvScaler.Scale?) tikvScaler.Scale-->autoFailover(autoFailover?) autoFailover-->tikvUpgrader.Upgrade(tikvUpgrader.Upgrade?) tikvUpgrader.Upgrade-->updateStatefulSet(updateStatefulSet?) ``` We will begin to introduce the details in TiKV controller. By doing so, you can learn about how a controller in TiDB Operator is designed, what reponsibility the controller should cover, how to enrich the function of Kubernetes. ### Concept 1: CRD, reconcile, declarative API Let's introduce these concepts first. CRD, Custom Resources Definition, is user-defined. You need to define the `spec` part to describe your demand, and `status` part to store the current status. A CRD is a record on `etcd`. The controller watches `kube-apiserver` to maintain the CRD as the user wants. `watch` is a function of `kube-apiserver`, the ability is provided by `etcd`. The controller, or `operator` as we define before, provides the way to adjust the environment from `status` to `spec`. The real actions of operator is called `reconcile`. ```mermaid graph TD etcd(etcd)-->|store records|kube-apiserver(kube-apiserver) kube-apiserver---|watch records|TiKV-controller(TiKV Controller) TiKV-controller-->|spec or status changes|adjust(adjust cluster status to spec) ``` > Note: > > There is a trick in tidb-operator that `status` can be a small state database for stateful application. ### TiKV deployment The main proces of deployment includes these works. ```mermaid graph TD tikv-lifecycle(TiKV lifecycle management) create-tikv-cluster(Create TiKV cluster) pd-cluster-available(Wait for PD cluster available) create-tikv-service(Create TiKV service) create-tikv-statefulset(Create TiKV statefulset) sync-tikv-status-from-pd-to-tidbcluster(Sync the status of the TiKV cluster to tidbcluster) set-scheduler-labels-to-TiKV-stores(Set scheduler labels to TiKV stores) upgrade-tikv-cluster(Upgrade TiKV cluster) scale-tikv-cluster(Scale out/in the TiKV cluster) failover(Failover the TiKV cluster) tikv-lifecycle-->create-tikv-cluster tikv-lifecycle-->scale-tikv-cluster tikv-lifecycle-->upgrade-tikv-cluster tikv-lifecycle-->failover create-tikv-cluster-->pd-cluster-available pd-cluster-available-->create-tikv-service create-tikv-service-->create-tikv-statefulset create-tikv-statefulset-->sync-tikv-status-from-pd-to-tidbcluster sync-tikv-status-from-pd-to-tidbcluster-->set-scheduler-labels-to-TiKV-stores ``` We can see that some operations are common when `scale`, `upgrade` and `failover`. It's because these lifecycle event manager just deal with some additional work such as recording and planning. The main process to deal with the Kubernestes resource, such as `pod`, `pvc` is implemented by the reconciling process of `tidbcluster`. Other events handlers just change the `status` according to the real condition to set off the `reconciling` functions. ### TiKV scale When `replicas` in `spec` changes, TiKV Controller will judge whether the request is for scaling up or scaling down the TiKV cluster. For scaling up, we need to release the unused PVC. Because the TiKV cluster is healthy, the retained PVC isn't needed for the cluster. For scaling down, we need to delete store records in PD. We need to judge whether the store is deleted successfully. Moreover, we should judge whether the POD is locked in pending. In this case, we can delete this pod directly without PD permission. ```mermaid graph TD tikvMemberManager(mm.NewTiKVMemberManager) tikvMemberManager-->tikvFailover*(tikvFailover*) tikvMemberManager-->tikvScaler(tikvScaler pkg/manager/member/tikv_scaler.go) tikvMemberManager-->tikvUpgrader*(tikvUpgrader*) tikvMemberManager-->skipped(...) tikvScaler-->scale_init(Scale) scale_init-->scale_general(pkg/manager/member/scaler.go extend general scaler.go) scale_general-->Scale(Scale) Scale-->|newSet < oldSet|ScaleIn(ScaleIn) Scale-->|newSet > oldSet|ScaleOut(ScaleOut) ScaleIn-->controller.GetPDClient.DeleteStore controller.GetPDClient.DeleteStore-->TombstoneStore? TombstoneStore?-->UpdatePVC(setDeferDeletingPVC) UpdatePVC-->setReplicasAndDeleteSlots setReplicasAndDeleteSlots controller.GetPDClient.DeleteStore-->!podutil.IsPodReady !podutil.IsPodReady-->isRegistratedinpd?(pod is registread in PD?) isRegistratedinpd?-->wait(wait for 5 sec) wait-->setReplicasAndDeleteSlots ScaleOut-->deleteDeferDeletingPVC(deleteDeferDeletingPVC deleted unused PVC) deleteDeferDeletingPVC-->setReplicasAndDeleteSlots ``` ### TiKV upgrade For upgrading in TiKV, we need to `evictleader` with PD. When the leader is transferred properly, we can set the upgrade partition to update. ```mermaid graph TD tikvMemberManager(mm.NewTiKVMemberManager) tikvMemberManager-->tikvFailover*(tikvFailover*) tikvMemberManager-->tikvScaler*(tikvScaler*) tikvMemberManager-->tikvUpgrader(tikvUpgrader pkg/manager/member/tikv_upgrader.go) tikvMemberManager-->skipped(...) tikvUpgrader-->pdAvailable? pdAvailable?-->TiKVScaling? TiKVScaling?-->status.Synced? status.Synced?-->podisReady?(pod is ready?) podisReady?-->upgradeTiKVPod upgradeTiKVPod-->beginEvictLeader beginEvictLeader-->controller.GetPDClient.BeginEvictLeader controller.GetPDClient.BeginEvictLeader-->readyToUpgrade?(readyToUpgrade? store.LeaderCount or EvictLeaderTimeout) readyToUpgrade?-->endEvictLeader endEvictLeader-->setUpgradePartition ``` ### TiKV failover For failover in TiKV, the functions in `tikv_failover.go` are just update the `status` according to the situation. The real process of the failed resources are implemented by `TiKV Controller`. ```mermaid graph TD tikvMemberManager(tikvMemberManager pkg/manager/member/tikv_member_manager.go)-->|first round:plan for recover|tikvFailover(tikvFailover pkg/manager/member/tikv_failover.go) tikvFailover-->setrecords(set records tikvFailover.Failover only changes status, waiting for next sync) setrecords-->tf.isPodDesired(tf.isPodDesired? to avoid recover the pod deleted by advanced statefulsets) tf.isPodDesired-->setDeadline(set deadline deadline := store.LastTransitionTime + tf.tikvFailoverPeriod) setDeadline-->check-status-deadline(store.State == v1alpha1.TiKVStateDown && time.Now.After deadline check if there are dead TiKV instances more than expected) check-status-deadline-->set-failure-records-in-status(set failure records in status) set-failure-records-in-status-->tikvMemberManager tikvMemberManager-->|second round:do recover|sync(sync recreate failed resources) sync-->tikvFailover.RemoveUndesiredFailures(RemoveUndesiredFailures remove undesired records to avoid unused recover) tikvFailover.RemoveUndesiredFailures-->tikvFailover.Recover(tikvFailover.Recover remove records which the resources have been recoverd) ``` ### Discovery In the startup scripts of each components, there is a process to query information from `discovery`. For example, TiDB will fetch some program flags from `discovery` by `curl`. When differnet component query the discovery by different time, `discovery` service response with the correspondent answer. ### Concept 2: tidb-scheduler and scheduler extender, scheduling rules for TiDB In this concept, scheduling is mainly about which node to place the pod, which storage node to place PV. For example, TiDB Operator will place TiKV seperately on 3 nodes, not place them on the same node, or take up the whole node by `quota`. TiDB need to seperate the hot region. This is a unique demand of TiDB, or this shouldn't implement by Kubernetes by default. So we need to extend the scheduler to fulfill these demands. The work [TiDB Scheduler](https://docs.pingcap.com/tidb-in-kubernetes/stable/tidb-scheduler) do is to response the query from kube-scheduler. There are some basic rules of TiDB cluster. Scheduling rule 1(PD): Make sure that the number of PD instances scheduled on each node is less than Replicas / 2. Scheduling rule 2(TiKV): If the number of Kubernetes nodes is less than three (in this case, TiKV cannot achieve high availability), scheduling is not limited; otherwise, the number of TiKV instances that can be scheduled on each node is no more than ceil(Replicas / 3). Scheduling rule 3(TiDB): When you perform a rolling update to a TiDB instance, the instance tends to be scheduled back to its original node. The code structure of tidb-scheduler is listed below: ```mermaid graph TD charts/tidb-operator/templates/config/_scheduler-policy-json.tpl(/charts/tidb-operator/templates/config/_scheduler-policy-json.tpl urlPrefix:http://127.0.0.1:10262/scheduler reference the schduler extender in Policy) cmd/scheduler/main.go(cmd/scheduler/main.go entry function) charts/tidb-operator/templates/config/_scheduler-policy-json.tpl-->cmd/scheduler/main.go pkg/scheduler/server/mux.go(pkg/scheduler/server/mux.go url router) cmd/scheduler/main.go-->pkg/scheduler/server/mux.go pkg/scheduler/scheduler.go(pkg/scheduler/scheduler.go main function, import some functions in predicates) pkg/scheduler/server/mux.go-->pkg/scheduler/scheduler.go pkg/scheduler/predicates/ha.go(pkg/scheduler/predicates/ha.go) pkg/scheduler/predicates/stable_scheduling.go(pkg/scheduler/predicates/stable_scheduling.go) pkg/scheduler/predicates/predicate.go(pkg/scheduler/predicates/predicate.go) pkg/scheduler/scheduler.go-->pkg/scheduler/predicates/ha.go pkg/scheduler/scheduler.go-->pkg/scheduler/predicates/stable_scheduling.go pkg/scheduler/scheduler.go-->pkg/scheduler/predicates/predicate.go ``` ## Learn from TiKV controller To complete the logics of TiKV controller, we need to focus on how the lifecycle evenet managers are registrated on Kubernetes. They are included in submodules of some controllers. These controllers run some listeners to watch kube-apiserver with Kubernetes SDK. When changes are applyed, the listeners inform the related resources to set off the related functions to handle issues. In this process, `status` acts as a state database for the stateful functions. Kubernetes maintain the operators. The operators watch the situations, reflect the situations in `status`. The handler function adjust the `status` to `spec`, which is demanded by the user.