TL;DR
When delete a kubernetes persistent volume by accident, it may stuck in the terminating status due to kubernetes.io/pv-protection
finalizer prevent it from being deleted. You can use this k8s-reset-terminating-pv tool to reset its status back to bound.
Prologue
You may notice this is the first post of my blog.🥳 Actually, I wanted to write a blog a very very long time ago but did not find anything worth sharing until one day I deleted a very important persistent volume by accident.😱 The PV is used by the docker image repository that holds hundreds of images used by several kubernetes clusters and the reclaim policy of the PV is Delete. I can still recall vividly the moment I realized what I just did, but after a while, the PV is still in the Terminating status, my first thought was the size of the PV is so big that it takes times to delete it, then I realized it is the finalizer that protect the PV from being deleted. Thank God the kubernetes has such a beautiful design that gives careless people like me a second chance.😂
As I said earlier, the PV is now stuck in the Terminating status,
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89 1Gi RWO Delete Terminating image/repo managed-nfs-storage 29h
and I want to set it back to Bound status. I thought that should be a piece of cake, I just need to set the status of the PV to Bound in its YAML definition, right? No, that is not how finalizer works. Even though the PV appears in Terminating status, its actual status is still Bound, the finalizer relies on deletionTimestamp
and deletionGracePeriodSeconds
fields inside metadata to fulfill its purpose.
kind: PersistentVolume
apiVersion: v1
metadata:
annotations:
pv.kubernetes.io/provisioned-by: storage.io/nfs
selfLink: '/api/v1/persistentvolumes/pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89'
deletionTimestamp: '2020-08-23T09:38:42Z'
resourceVersion: '112535964'
name: 'pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89'
uid: 'd77ec8f7-710f-4005-b119-b7c1cda8d7e7'
deletionGracePeriodSeconds: 0
creationTimestamp: '2020-08-22T03:39:41Z'
finalizers:
- kubernetes.io/pv-protection
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
claimRef:
kind: PersistentVolumeClaim
namespace: image
name: repo
uid: 'eef4ec4b-326d-47e6-b11c-6474a5fd4d89'
apiVersion: v1
resourceVersion: '111102610'
persistentVolumeReclaimPolicy: Delete
storageClassName: managed-nfs-storage
volumeMode: Filesystem
status:
phase: Bound
In types.go source code, it clearly point out that:
Once the deletionTimestamp is set, this value may not be unset or be set further into the future, although it may be shortened or the resource may be deleted prior to this time.
Which means I cannot set the status back to Bound in a normal(kubernetes supported) way. Kubernetes uses etcd as its data store, so I guess I can delete the value of deletionTimestamp
and deletionGracePeriodSeconds
in etcd to forcedly reset the PV status back to Bound, and this answer supported my point of view.
The journey of etcd
Since I cannot reset PV status back to Bound using the kubectl client or calling the kubernetes API, I decided to update the PV’s value in etcd directly. There is a great post about how kubernetes uses etcd. Please read this article before continuing.
First, I need to connect to etcd to get the value of the PV in terminating status. I use etcd Go client as it is used by kubernetes to interact with etcd. It uses PKI certificates to establish a security connection, you can get the required etcd CA(ca.crt), Public(etcd.crt) and Private(etcd.key) certificates from the kubernetes etcd Node/Pod that explained on above post.
The code to create an etcd client:
func etcdClient() (*clientv3.Client, error) {
ca, err := ioutil.ReadFile(etcdCA)
if err != nil {
return nil, err
}
keyPair, err := tls.LoadX509KeyPair(etcdCert, etcdKey)
if err != nil {
return nil, err
}
certPool := x509.NewCertPool()
certPool.AppendCertsFromPEM(ca)
return clientv3.New(clientv3.Config{
Endpoints: []string{fmt.Sprintf("%s:%d", etcdHost, etcdPort)}, //localhost:2379
DialTimeout: 2 * time.Second,
TLS: &tls.Config{
RootCAs: certPool,
Certificates: []tls.Certificate{keyPair},
},
})
}
Then get the value of the PV:
key := "/registry/persistentvolumes/pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89"
resp, err := client.Get(ctx, key)
if err != nil {
return err
}
fmt.Println(string(resp.Kvs[0].Value))
I forward the etcd port on the pod to the localhost, then my client can use localhost:2379 to connect to the etcd server.
kubectl port-forward pods/etcd-member-master0 2379:2379 -n etcd
The output of the code shows the raw
value of PV in etcd:
k8s
v1PersistentVolume�
�
(pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89"*$d77ec8f7-710f-4005-b119-b7c1cda8d7e72������PZ
appimageb1
pv.kubernetes.io/provisioned-bystorage.io/nfsrubernetes.io/pv-protectionz�
storage
1GiS*Q
ReadWriteOnce"\ata/nfs1/image-repo-pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89
PersistentVolumeClaimimagerepo"$eef4ec4b-326d-47e6-b11c-6474a5fd4d89*v12 111102610:*Delete2managed-nfs-storageB
Filesystem
Bound"
Even though most of the content is readable by a human, there are several mysterious characters ‘�’ scattered around the output. I can see the PV’s Version, Kind, and UID, its label, size, and some attributes, but where are the deletionTimestamp
and deletionGracePeriodSeconds
?
After some digging, I understand that kubernetes has two serialization format, the JSON and the Protobuf. You can verify this by looking at type.go, each struct field that needs to be serialized has json and protobuf definition in its tag like `json:"kind,omitempty" protobuf:"bytes,1,opt,name=kind"`
. The JSON is the default serialization format for the API, so when we use kubectl to get the resources we get them in JSON format.
kubectl get pod --v=9
I0824 10:43:27.478474 30383 round_trippers.go:423] curl -k -v -XGET -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: kubectl/v1.18.8 (darwin/amd64) kubernetes/9f2892a" 'https://10.61.2.249:6443/api/v1/namespaces/default/pods?limit=500'
I0824 10:43:27.494898 30383 round_trippers.go:443] GET https://10.61.2.249:6443/api/v1/namespaces/default/pods?limit=500 200 OK in 16 milliseconds
I0824 10:43:27.494937 30383 round_trippers.go:449] Response Headers:
I0824 10:43:27.494955 30383 round_trippers.go:452] Cache-Control: no-cache, private
I0824 10:43:27.494961 30383 round_trippers.go:452] Content-Type: application/json
I0824 10:43:27.494966 30383 round_trippers.go:452] Date: Mon, 24 Aug 2020 02:43:27 GMT
I0824 10:43:27.496179 30383 request.go:1068] Response Body: {"kind":"Table","apiVersion":"meta.k8s.io/v1","metadata":{"selfLink":"/api/v1/namespaces/default/pods","resourceVersion":"812855"},"columnDefinitions":[{"name":"Name","type":"string","format":"name","description":"Name must be unique within a namespace. Is required when creating resources, although some resources may allow a client to request the generation of an appropriate name automatically. Name is primarily intended for creation idempotence and configuration definition. Cannot be updated. More info: http://kubernetes.io/docs/user-guide/identifiers#names","priority":0},{"name":"Ready","type":"string","format":"","description":"The aggregate readiness state of this pod for accepting traffic.","priority":0},{"name":"Status","type":"string","format":"","description":"The aggregate status of the containers in this pod.","priority":0},{"name":"Restarts","type":"integer","format":"","description":"The number of times the containers in this pod have been restarted.","priority":0},{"name":"Age","type":"string","format":"","description":"CreationTimestamp is a timestamp representing the server time when this object was created. It is not guaranteed to be set in happens-before order across separate operations. Clients may not set this value. It is represented in RFC3339 form and is in UTC.\n\nPopulated by the system. Read-only. Null for lists. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata","priority":0},{"name":"IP","type":"string","format":"","description":"IP address allocated to the pod. Routable at least within the cluster. Empty if not yet allocated.","priority":1},{"name":"Node","type":"string","format":"","description":"NodeName is a request to schedule this pod onto a specific node. If it is non-empty, the scheduler simply schedules this pod onto that node, assuming that it fits resource requirements.","priority":1},{"name":"Nominated Node","type":"string","format":"","description":"nominatedNodeName is set only when this pod preempts other pods on the node, but it cannot be scheduled right away as preemption victims receive their graceful termination periods. This field does not guarantee that the pod will be scheduled on this node. Scheduler may decide to place the pod elsewhere if other nodes become available sooner. Scheduler may also decide to give the resources on this node to a higher priority pod that is created after preemption. As a result, this field may be different than PodSpec.nodeName when the pod is scheduled.","priority":1},{"name":"Readiness Gates","type":"string","format":"","description":"If specified, all readiness gates will be evaluated for pod readiness. A pod is ready when all its containers are ready AND all conditions specified in the readiness gates have status equal to \"True\" More info: https://git.k8s.io/enhancements/keps/sig-network/0007-pod-ready%2B%2B.md","priority":1}],"rows":[{"cells":["busybox","1/1","Running",72,"3d","10.200.1.5","k8s5","\u003cnone\u003e","\u003cnone\u003e"],"object":{"kind":"PartialObjectMetadata","apiVersion":"meta.k8s.io/v1","metadata":{"name":"busybox","namespace":"default","selfLink":"/api/v1/namespaces/default/pods/busybox","uid":"8f208315-c20c-45b8-900b-57d7c13678e4","resourceVersion":"811714","creationTimestamp":"2020-08-21T02:34:42Z","labels":{"run":"busybox"},"managedFields":[{"manager":"kubectl","operation":"Update","apiVersion":"v1","time":"2020-08-21T02:34:42Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:labels":{".":{},"f:run":{}}},"f:spec":{"f:containers":{"k:{\"name\":\"busybox\"}":{".":{},"f:command":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:resources":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{},"f:terminationGracePeriodSeconds":{}}}},{"manager":"kubelet","operation":"Update","apiVersion":"v1","time":"2020-08-24T02:35:46Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{"k:{\"type\":\"ContainersReady\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Initialized\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Ready\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}}},"f:containerStatuses":{},"f:hostIP":{},"f:phase":{},"f:podIP":{},"f:podIPs":{".":{},"k:{\"ip\":\"10.200.1.5\"}":{".":{},"f:ip":{}}},"f:startTime":{}}}}]}}},{"cells":["nginx-f89759699-xcw4j","1/1","Running",0,"2d23h","10.200.0.6","k8s4","\u003cnone\u003e","\u003cnone\u003e"],"object":{"kind":"PartialObjectMetadata","apiVersion":"meta.k8s.io/v1","metadata":{"name":"nginx-f89759699-xcw4j","generateName":"nginx-f89759699-","namespace":"default","selfLink":"/api/v1/namespaces/default/pods/nginx-f89759699-xcw4j","uid":"662d1afd-4419-4224-b445-db28274a8029","resourceVersion":"263856","creationTimestamp":"2020-08-21T03:01:17Z","labels":{"app":"nginx","pod-template-hash":"f89759699"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"nginx-f89759699","uid":"a1760f8b-21a4-4681-aa86-705d52e28a7f","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"v1","time":"2020-08-21T13:02:20Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:generateName":{},"f:labels":{".":{},"f:app":{},"f:pod-template-hash":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"a1760f8b-21a4-4681-aa86-705d52e28a7f\"}":{".":{},"f:apiVersion":{},"f:blockOwnerDeletion":{},"f:controller":{},"f:kind":{},"f:name":{},"f:uid":{}}}},"f:spec":{"f:containers":{"k:{\"name\":\"nginx\"}":{".":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:resources":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{},"f:terminationGracePeriodSeconds":{}}}},{"manager":"kubelet","operation":"Update","apiVersion":"v1","time":"2020-08-21T13:04:18Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{"k:{\"type\":\"ContainersReady\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Initialized\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Ready\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}}},"f:containerStatuses":{},"f:hostIP":{},"f:phase":{},"f:podIP":{},"f:podIPs":{".":{},"k:{\"ip\":\"10.200.0.6\"}":{".":{},"f:ip":{}}},"f:startTime":{}}}}]}}}]}
However, according to the API reference, we can also get the resource in Protobuf format by setting Accept
request header to application/vnd.kubernetes.protobuf
curl -k -XGET -H "Authorization: Bearer ${TOKEN}" -H "Accept: application/vnd.kubernetes.protobuf" 'https://10.61.2.249:6443/api/v1/persistentvolumes/pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89' --output -
k8s
v1PersistentVolume�
�
(pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89"B/api/v1/persistentvolumes/pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89*$d77ec8f7-710f-4005-b119-b7c1cda8d7e72 112535964����PZ
appimageb1
pv.kubernetes.io/provisioned-bystorage.io/nfsr
storage
1GiS*Q
ReadWriteOnce"\ata/nfs1/image-repo-pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89
PersistentVolumeClaimimagerepo"$eef4ec4b-326d-47e6-b11c-6474a5fd4d89*v12 111102610:*Delete2managed-nfs-storageB
Filesystem
Bound"%
Looks familiar? It confirmed that the value stored in etcd is in Protobuf format, and the k8s
characters at the beginning of the value are the kubernetes Protobuf magic number prefix which according to the API doc helps identify content in disk or etcd as Protobuf. The schema for the above binary value is:
A four byte magic number prefix:
Bytes 0-3: "k8s\x00" [0x6b, 0x38, 0x73, 0x00]
An encoded Protobuf message with the following IDL:
message Unknown {
// typeMeta should have the string values for "kind" and "apiVersion" as set on the JSON object
optional TypeMeta typeMeta = 1;
// raw will hold the complete serialized object in protobuf. See the protobuf definitions in the client libraries for a given kind.
optional bytes raw = 2;
// contentEncoding is encoding used for the raw data. Unspecified means no encoding.
optional string contentEncoding = 3;
// contentType is the serialization method used to serialize 'raw'. Unspecified means application/vnd.kubernetes.protobuf and is usually
// omitted.
optional string contentType = 4;
}
message TypeMeta {
// apiVersion is the group/version for this type
optional string apiVersion = 1;
// kind is the name of the object schema. A protobuf definition should exist for this object.
optional string kind = 2;
}
... ...
Protobuf provides better performance at scale, kubernetes uses this format to store the value in etcd, however, this is bad news for me because I want to update the value directly. If it is in JSON format I could just get the string value and remove deletionTimestamp
and deletionGracePeriodSeconds
key-value in the string then save it back. But now I cannot simply manipulate the return value as it is in Protobuf(binary) format. Maybe I can? The deletionTimestamp
is a time.Time
type and deletionGracePeriodSeconds
is int64
, it probably mapping to a few ‘�’ in the binary value, I can refer to Protobuf encoding to know exactly where they are and changes the bytes to remove them. I do not want to go that path as it is risky and costly, because if I do anything careless again, it will make the situation even worse. For example(yes, I tried🤪), like:
kubectl get pv
Error from server (NotAcceptable): object *core.PersistentVolume does not implement the protobuf marshalling interface and cannot be encoded to a protobuf message
A better way is to decode(unmarshal) the binary value into Go data type, change the required values, encode the updated value, then save it back to etcd.
The journey of serializer
As I mentioned earlier, kubernetes do not simply uses Protobuf to encode its data type, it creates a customized wrapper to encode/decode Go data type to and from Protobuf, remember the four-byte magic number? I think I can reuse this wrapper to do my work. It should be related to the code that interacts with etcd, and the API server is the only kubernetes component that connects to etcd, I start searching clues in the API server code. The store.go looks promising, especially the runtime.Codec
field inside type store struct
. I guess it may be used for data encoding/decoding, so let’s dig deeper. The Codec turn out to be a Serializer interface composed by Encoder and Decoder interfaces, bingo. Now what I need is to find out a Protobuf Serializer implementation, Let’s keep digging. When I look at the directory structure I see folders named json
and protobuf
and a file named codec_factory.go
, the json folder contains the JSON implementation of the Serializer interface, and the protobuf folder contains the Protobuf implementation that I was looking for. Even better, the codec_factory.go has the code about how to initialize the Protobuf Serializer. Now everything is sorted out, I just need to write the code to complete the work, the logic is like:
The corresponding code is straightforward:
func recoverPV(ctx context.Context, client *clientv3.Client) error {
gvk := schema.GroupVersionKind{Group: v1.GroupName, Version: "v1", Kind: "PersistentVolume"}
pv := &v1.PersistentVolume{}
runtimeScheme := runtime.NewScheme()
runtimeScheme.AddKnownTypeWithName(gvk, pv)
protoSerializer := protobuf.NewSerializer(runtimeScheme, runtimeScheme)
key := fmt.Sprintf("/%s/persistentvolumes/%s", k8sKeyPrefix, pvName)
// Get value from etcd
resp, err := client.Get(ctx, key)
if err != nil {
return err
}
if len(resp.Kvs) < 1 {
return fmt.Errorf("cannot find persistent volume [%s] in etcd with key [%s]\nplease check the k8s-key-prefix and the persistent volume name are set correctly", pvName, key)
}
// Decode protobuf value to Go pv struct
_, _, err = protoSerializer.Decode(resp.Kvs[0].Value, &gvk, pv)
if err != nil {
return err
}
if (*pv).ObjectMeta.DeletionTimestamp == nil {
return fmt.Errorf("persistent volume [%s] is not in terminating status", pvName)
}
// Set PV status from terminating back to bound
(*pv).ObjectMeta.DeletionTimestamp = nil
(*pv).ObjectMeta.DeletionGracePeriodSeconds = nil
var fixedPV bytes.Buffer
// Encode fixed PV to protobuf value
err = protoSerializer.Encode(pv, &fixedPV)
if err != nil {
return err
}
// Write the updated value back to etcd
client.Put(ctx, key, fixedPV.String())
return nil
}
One thing worth mention is the etcd key format, the community version of kubernetes uses /registry
as etcd key prefix, so the key for persistent volume pv1 is /registry/persistentvolumes/pv1
. OpenShift uses /kubernetes.io
as prefix and the key for pv1 is /kubernetes.io/persistentvolumes/pv1
.
My environment is an OpenShift cluster.
The end
After run recoverPV()
, the terminating persistent volume is back to bound status. Finally.
./resetpv --k8s-key-prefix kubernetes.io pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-eef4ec4b-326d-47e6-b11c-6474a5fd4d89 1Gi RWO Delete Bound image/repo managed-nfs-storage 2d11h
Although I made a mistake, I did learn a lot by trying to fix it. The lessons learned are always double confirm and know how to restore the change before you update or delete anything on the important system. Unless you’re absolutely sure what you are doing.
The source code and how to use it can be found here.
Cheers.🍻