How to reset persistent volume status from terminating back to bound.

Aug 25, 2020 • 11 minutes to read

TL;DR

When delete a kubernetes persistent volume by accident, it may stuck in the terminating status due to kubernetes.io/pv-protection finalizer prevent it from being deleted. You can use this k8s-reset-terminating-pv tool to reset its status back to bound.

Prologue

You may notice this is the first post of my blog.🥳 Actually, I wanted to write a blog a very very long time ago but did not find anything worth sharing until one day I deleted a very important persistent volume by accident.😱 The PV is used by the docker image repository that holds hundreds of images used by several kubernetes clusters and the reclaim policy of the PV is Delete. I can still recall vividly the moment I realized what I just did, but after a while, the PV is still in the Terminating status, my first thought was the size of the PV is so big that it takes times to delete it, then I realized it is the finalizer that protect the PV from being deleted. Thank God the kubernetes has such a beautiful design that gives careless people like me a second chance.😂

As I said earlier, the PV is now stuck in the Terminating status,

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS        CLAIM         STORAGECLASS          REASON   AGE
pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89   1Gi        RWO            Delete           Terminating   image/repo    managed-nfs-storage            29h

and I want to set it back to Bound status. I thought that should be a piece of cake, I just need to set the status of the PV to Bound in its YAML definition, right? No, that is not how finalizer works. Even though the PV appears in Terminating status, its actual status is still Bound, the finalizer relies on deletionTimestamp and deletionGracePeriodSeconds fields inside metadata to fulfill its purpose.

kind: PersistentVolume
apiVersion: v1
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: storage.io/nfs
  selfLink: '/api/v1/persistentvolumes/pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89'
  deletionTimestamp: '2020-08-23T09:38:42Z'
  resourceVersion: '112535964'
  name: 'pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89'
  uid: 'd77ec8f7-710f-4005-b119-b7c1cda8d7e7'
  deletionGracePeriodSeconds: 0
  creationTimestamp: '2020-08-22T03:39:41Z'
  finalizers:
    - kubernetes.io/pv-protection
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  claimRef:
    kind: PersistentVolumeClaim
    namespace: image
    name: repo
    uid: 'eef4ec4b-326d-47e6-b11c-6474a5fd4d89'
    apiVersion: v1
    resourceVersion: '111102610'
  persistentVolumeReclaimPolicy: Delete
  storageClassName: managed-nfs-storage
  volumeMode: Filesystem
status:
  phase: Bound

In types.go source code, it clearly point out that:

Once the deletionTimestamp is set, this value may not be unset or be set further into the future, although it may be shortened or the resource may be deleted prior to this time.

Which means I cannot set the status back to Bound in a normal(kubernetes supported) way. Kubernetes uses etcd as its data store, so I guess I can delete the value of deletionTimestamp and deletionGracePeriodSeconds in etcd to forcedly reset the PV status back to Bound, and this answer supported my point of view.

The journey of etcd

Since I cannot reset PV status back to Bound using the kubectl client or calling the kubernetes API, I decided to update the PV’s value in etcd directly. There is a great post about how kubernetes uses etcd. Please read this article before continuing.

First, I need to connect to etcd to get the value of the PV in terminating status. I use etcd Go client as it is used by kubernetes to interact with etcd. It uses PKI certificates to establish a security connection, you can get the required etcd CA(ca.crt), Public(etcd.crt) and Private(etcd.key) certificates from the kubernetes etcd Node/Pod that explained on above post.

The code to create an etcd client:

func etcdClient() (*clientv3.Client, error) {
  ca, err := ioutil.ReadFile(etcdCA)
  if err != nil {
    return nil, err
  }

  keyPair, err := tls.LoadX509KeyPair(etcdCert, etcdKey)
  if err != nil {
    return nil, err
  }

  certPool := x509.NewCertPool()
  certPool.AppendCertsFromPEM(ca)

  return clientv3.New(clientv3.Config{
    Endpoints:   []string{fmt.Sprintf("%s:%d", etcdHost, etcdPort)}, //localhost:2379
    DialTimeout: 2 * time.Second,
    TLS: &tls.Config{
      RootCAs:      certPool,
      Certificates: []tls.Certificate{keyPair},
    },
  })
}

Then get the value of the PV:

key := "/registry/persistentvolumes/pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89"
resp, err := client.Get(ctx, key)
  if err != nil {
    return err
}
fmt.Println(string(resp.Kvs[0].Value))

I forward the etcd port on the pod to the localhost, then my client can use localhost:2379 to connect to the etcd server.

kubectl port-forward pods/etcd-member-master0 2379:2379 -n etcd

The output of the code shows the raw value of PV in etcd:

k8s

v1PersistentVolume�
�
(pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89"*$d77ec8f7-710f-4005-b119-b7c1cda8d7e72������PZ

appimageb1
pv.kubernetes.io/provisioned-bystorage.io/nfsrubernetes.io/pv-protectionz�

storage
1GiS*Q

ReadWriteOnce"\ata/nfs1/image-repo-pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89
PersistentVolumeClaimimagerepo"$eef4ec4b-326d-47e6-b11c-6474a5fd4d89*v12       111102610:*Delete2managed-nfs-storageB
Filesystem

Bound"

Even though most of the content is readable by a human, there are several mysterious characters ‘�’ scattered around the output. I can see the PV’s Version, Kind, and UID, its label, size, and some attributes, but where are the deletionTimestamp and deletionGracePeriodSeconds?

After some digging, I understand that kubernetes has two serialization format, the JSON and the Protobuf. You can verify this by looking at type.go, each struct field that needs to be serialized has json and protobuf definition in its tag like `json:"kind,omitempty" protobuf:"bytes,1,opt,name=kind"` . The JSON is the default serialization format for the API, so when we use kubectl to get the resources we get them in JSON format.

kubectl get pod --v=9

I0824 10:43:27.478474   30383 round_trippers.go:423] curl -k -v -XGET  -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: kubectl/v1.18.8 (darwin/amd64) kubernetes/9f2892a" 'https://10.61.2.249:6443/api/v1/namespaces/default/pods?limit=500'
I0824 10:43:27.494898   30383 round_trippers.go:443] GET https://10.61.2.249:6443/api/v1/namespaces/default/pods?limit=500 200 OK in 16 milliseconds
I0824 10:43:27.494937   30383 round_trippers.go:449] Response Headers:
I0824 10:43:27.494955   30383 round_trippers.go:452]     Cache-Control: no-cache, private
I0824 10:43:27.494961   30383 round_trippers.go:452]     Content-Type: application/json
I0824 10:43:27.494966   30383 round_trippers.go:452]     Date: Mon, 24 Aug 2020 02:43:27 GMT
I0824 10:43:27.496179   30383 request.go:1068] Response Body: {"kind":"Table","apiVersion":"meta.k8s.io/v1","metadata":{"selfLink":"/api/v1/namespaces/default/pods","resourceVersion":"812855"},"columnDefinitions":[{"name":"Name","type":"string","format":"name","description":"Name must be unique within a namespace. Is required when creating resources, although some resources may allow a client to request the generation of an appropriate name automatically. Name is primarily intended for creation idempotence and configuration definition. Cannot be updated. More info: http://kubernetes.io/docs/user-guide/identifiers#names","priority":0},{"name":"Ready","type":"string","format":"","description":"The aggregate readiness state of this pod for accepting traffic.","priority":0},{"name":"Status","type":"string","format":"","description":"The aggregate status of the containers in this pod.","priority":0},{"name":"Restarts","type":"integer","format":"","description":"The number of times the containers in this pod have been restarted.","priority":0},{"name":"Age","type":"string","format":"","description":"CreationTimestamp is a timestamp representing the server time when this object was created. It is not guaranteed to be set in happens-before order across separate operations. Clients may not set this value. It is represented in RFC3339 form and is in UTC.\n\nPopulated by the system. Read-only. Null for lists. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata","priority":0},{"name":"IP","type":"string","format":"","description":"IP address allocated to the pod. Routable at least within the cluster. Empty if not yet allocated.","priority":1},{"name":"Node","type":"string","format":"","description":"NodeName is a request to schedule this pod onto a specific node. If it is non-empty, the scheduler simply schedules this pod onto that node, assuming that it fits resource requirements.","priority":1},{"name":"Nominated Node","type":"string","format":"","description":"nominatedNodeName is set only when this pod preempts other pods on the node, but it cannot be scheduled right away as preemption victims receive their graceful termination periods. This field does not guarantee that the pod will be scheduled on this node. Scheduler may decide to place the pod elsewhere if other nodes become available sooner. Scheduler may also decide to give the resources on this node to a higher priority pod that is created after preemption. As a result, this field may be different than PodSpec.nodeName when the pod is scheduled.","priority":1},{"name":"Readiness Gates","type":"string","format":"","description":"If specified, all readiness gates will be evaluated for pod readiness. A pod is ready when all its containers are ready AND all conditions specified in the readiness gates have status equal to \"True\" More info: https://git.k8s.io/enhancements/keps/sig-network/0007-pod-ready%2B%2B.md","priority":1}],"rows":[{"cells":["busybox","1/1","Running",72,"3d","10.200.1.5","k8s5","\u003cnone\u003e","\u003cnone\u003e"],"object":{"kind":"PartialObjectMetadata","apiVersion":"meta.k8s.io/v1","metadata":{"name":"busybox","namespace":"default","selfLink":"/api/v1/namespaces/default/pods/busybox","uid":"8f208315-c20c-45b8-900b-57d7c13678e4","resourceVersion":"811714","creationTimestamp":"2020-08-21T02:34:42Z","labels":{"run":"busybox"},"managedFields":[{"manager":"kubectl","operation":"Update","apiVersion":"v1","time":"2020-08-21T02:34:42Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:labels":{".":{},"f:run":{}}},"f:spec":{"f:containers":{"k:{\"name\":\"busybox\"}":{".":{},"f:command":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:resources":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{},"f:terminationGracePeriodSeconds":{}}}},{"manager":"kubelet","operation":"Update","apiVersion":"v1","time":"2020-08-24T02:35:46Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{"k:{\"type\":\"ContainersReady\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Initialized\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Ready\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}}},"f:containerStatuses":{},"f:hostIP":{},"f:phase":{},"f:podIP":{},"f:podIPs":{".":{},"k:{\"ip\":\"10.200.1.5\"}":{".":{},"f:ip":{}}},"f:startTime":{}}}}]}}},{"cells":["nginx-f89759699-xcw4j","1/1","Running",0,"2d23h","10.200.0.6","k8s4","\u003cnone\u003e","\u003cnone\u003e"],"object":{"kind":"PartialObjectMetadata","apiVersion":"meta.k8s.io/v1","metadata":{"name":"nginx-f89759699-xcw4j","generateName":"nginx-f89759699-","namespace":"default","selfLink":"/api/v1/namespaces/default/pods/nginx-f89759699-xcw4j","uid":"662d1afd-4419-4224-b445-db28274a8029","resourceVersion":"263856","creationTimestamp":"2020-08-21T03:01:17Z","labels":{"app":"nginx","pod-template-hash":"f89759699"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"nginx-f89759699","uid":"a1760f8b-21a4-4681-aa86-705d52e28a7f","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"v1","time":"2020-08-21T13:02:20Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:generateName":{},"f:labels":{".":{},"f:app":{},"f:pod-template-hash":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"a1760f8b-21a4-4681-aa86-705d52e28a7f\"}":{".":{},"f:apiVersion":{},"f:blockOwnerDeletion":{},"f:controller":{},"f:kind":{},"f:name":{},"f:uid":{}}}},"f:spec":{"f:containers":{"k:{\"name\":\"nginx\"}":{".":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:resources":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{},"f:terminationGracePeriodSeconds":{}}}},{"manager":"kubelet","operation":"Update","apiVersion":"v1","time":"2020-08-21T13:04:18Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{"k:{\"type\":\"ContainersReady\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Initialized\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Ready\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}}},"f:containerStatuses":{},"f:hostIP":{},"f:phase":{},"f:podIP":{},"f:podIPs":{".":{},"k:{\"ip\":\"10.200.0.6\"}":{".":{},"f:ip":{}}},"f:startTime":{}}}}]}}}]}

However, according to the API reference, we can also get the resource in Protobuf format by setting Accept request header to application/vnd.kubernetes.protobuf

curl -k -XGET -H "Authorization: Bearer ${TOKEN}" -H "Accept: application/vnd.kubernetes.protobuf" 'https://10.61.2.249:6443/api/v1/persistentvolumes/pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89' --output -
k8s

v1PersistentVolume�
(pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89"B/api/v1/persistentvolumes/pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89*$d77ec8f7-710f-4005-b119-b7c1cda8d7e72  112535964����PZ

appimageb1
pv.kubernetes.io/provisioned-bystorage.io/nfsr

storage
1GiS*Q

ReadWriteOnce"\ata/nfs1/image-repo-pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89
PersistentVolumeClaimimagerepo"$eef4ec4b-326d-47e6-b11c-6474a5fd4d89*v12  111102610:*Delete2managed-nfs-storageB
Filesystem

Bound"%

Looks familiar? It confirmed that the value stored in etcd is in Protobuf format, and the k8s characters at the beginning of the value are the kubernetes Protobuf magic number prefix which according to the API doc helps identify content in disk or etcd as Protobuf. The schema for the above binary value is:

A four byte magic number prefix:
  Bytes 0-3: "k8s\x00" [0x6b, 0x38, 0x73, 0x00]

An encoded Protobuf message with the following IDL:
  message Unknown {
    // typeMeta should have the string values for "kind" and "apiVersion" as set on the JSON object
    optional TypeMeta typeMeta = 1;

    // raw will hold the complete serialized object in protobuf. See the protobuf definitions in the client libraries for a given kind.
    optional bytes raw = 2;

    // contentEncoding is encoding used for the raw data. Unspecified means no encoding.
    optional string contentEncoding = 3;

    // contentType is the serialization method used to serialize 'raw'. Unspecified means application/vnd.kubernetes.protobuf and is usually
    // omitted.
    optional string contentType = 4;
  }

  message TypeMeta {
    // apiVersion is the group/version for this type
    optional string apiVersion = 1;
    // kind is the name of the object schema. A protobuf definition should exist for this object.
    optional string kind = 2;
  }
... ...  

Protobuf provides better performance at scale, kubernetes uses this format to store the value in etcd, however, this is bad news for me because I want to update the value directly. If it is in JSON format I could just get the string value and remove deletionTimestamp and deletionGracePeriodSeconds key-value in the string then save it back. But now I cannot simply manipulate the return value as it is in Protobuf(binary) format. Maybe I can? The deletionTimestamp is a time.Time type and deletionGracePeriodSeconds is int64, it probably mapping to a few ‘�’ in the binary value, I can refer to Protobuf encoding to know exactly where they are and changes the bytes to remove them. I do not want to go that path as it is risky and costly, because if I do anything careless again, it will make the situation even worse. For example(yes, I tried🤪), like:

kubectl get pv
Error from server (NotAcceptable): object *core.PersistentVolume does not implement the protobuf marshalling interface and cannot be encoded to a protobuf message

A better way is to decode(unmarshal) the binary value into Go data type, change the required values, encode the updated value, then save it back to etcd.

The journey of serializer

As I mentioned earlier, kubernetes do not simply uses Protobuf to encode its data type, it creates a customized wrapper to encode/decode Go data type to and from Protobuf, remember the four-byte magic number? I think I can reuse this wrapper to do my work. It should be related to the code that interacts with etcd, and the API server is the only kubernetes component that connects to etcd, I start searching clues in the API server code. The store.go looks promising, especially the runtime.Codec field inside type store struct. I guess it may be used for data encoding/decoding, so let’s dig deeper. The Codec turn out to be a Serializer interface composed by Encoder and Decoder interfaces, bingo. Now what I need is to find out a Protobuf Serializer implementation, Let’s keep digging. When I look at the directory structure I see folders named json and protobuf and a file named codec_factory.go, the json folder contains the JSON implementation of the Serializer interface, and the protobuf folder contains the Protobuf implementation that I was looking for. Even better, the codec_factory.go has the code about how to initialize the Protobuf Serializer. Now everything is sorted out, I just need to write the code to complete the work, the logic is like:

%%{init: {'theme': 'dark'}}%% graph TD; 1[Get PV value from etcd which in protobuf format] ==> 2[Decode protobuf value to PV struct]; 2 ==> 3[Set PV status from Terminating to Bound by removing value of DeletionTimestamp and DeletionGracePeriodSeconds]; 3 ==> 4[Encode fixed PV struct to protobuf value]; 4 ==> 5[Write the updated protobuf value back to etcd];

The corresponding code is straightforward:

func recoverPV(ctx context.Context, client *clientv3.Client) error {

  gvk := schema.GroupVersionKind{Group: v1.GroupName, Version: "v1", Kind: "PersistentVolume"}
  pv := &v1.PersistentVolume{}

  runtimeScheme := runtime.NewScheme()
  runtimeScheme.AddKnownTypeWithName(gvk, pv)
  protoSerializer := protobuf.NewSerializer(runtimeScheme, runtimeScheme)
  key := fmt.Sprintf("/%s/persistentvolumes/%s", k8sKeyPrefix, pvName)

  // Get value from etcd
  resp, err := client.Get(ctx, key)
  if err != nil {
    return err
  }

  if len(resp.Kvs) < 1 {
    return fmt.Errorf("cannot find persistent volume [%s] in etcd with key [%s]\nplease check the k8s-key-prefix and the persistent volume name are set correctly", pvName, key)
  }

  // Decode protobuf value to Go pv struct
  _, _, err = protoSerializer.Decode(resp.Kvs[0].Value, &gvk, pv)
  if err != nil {
    return err
  }

  if (*pv).ObjectMeta.DeletionTimestamp == nil {
    return fmt.Errorf("persistent volume [%s] is not in terminating status", pvName)
  }

  // Set PV status from terminating back to bound
  (*pv).ObjectMeta.DeletionTimestamp = nil
  (*pv).ObjectMeta.DeletionGracePeriodSeconds = nil

  var fixedPV bytes.Buffer
  // Encode fixed PV to protobuf value
  err = protoSerializer.Encode(pv, &fixedPV)
  if err != nil {
    return err
  }

  // Write the updated value back to etcd
  client.Put(ctx, key, fixedPV.String())
  return nil
}

One thing worth mention is the etcd key format, the community version of kubernetes uses /registry as etcd key prefix, so the key for persistent volume pv1 is /registry/persistentvolumes/pv1. OpenShift uses /kubernetes.io as prefix and the key for pv1 is /kubernetes.io/persistentvolumes/pv1.

My environment is an OpenShift cluster.

The end

After run recoverPV(), the terminating persistent volume is back to bound status. Finally.

./resetpv --k8s-key-prefix kubernetes.io pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89

kubectl get pv

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM         STORAGECLASS          REASON   AGE
pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89   1Gi        RWO            Delete           Bound    image/repo    managed-nfs-storage            2d11h

Although I made a mistake, I did learn a lot by trying to fix it. The lessons learned are always double confirm and know how to restore the change before you update or delete anything on the important system. Unless you’re absolutely sure what you are doing.

The source code and how to use it can be found here.

Cheers.🍻

kubernetesetcdgok8spvprotobuf