DevOps Attacks Managed Kubernetes cluster, Part 2.

Intro
Kubernetes cluster setup (environment)
Kubernetes privilege escalation to admin level
Preventing Kubernetes privilege escalation

This is the second part of a guest story from Pavel Selivanov - an experienced DevOps expert, who deployed Kubernetes clusters for real estate, on-line services and other industries. Find first part there

Intro

Today I will describe how to gather intel, attack, override defences and gain control of 98% of Kubernetes clusters out there or just shut it down — and how to protect your infrastructure from this.

There is a problem with your Kubernetes cluster you are most likely not aware of. Most likely, if you have monitoring installed in your cluster, I undertake to predict that it is called Prometheus.

Abusing Prometheus monitoring

I will tell you now that it will be valid for both the Prometheus operator and Prometheus delivered in its pure form. The challenge here is that if I cannot get admin access to the cluster quickly, I just need to search more. And I can search with the help of your monitoring.

Probably everyone read the same articles online, and monitoring is in the monitoring namespace. Helm chart is named approximately the same for everyone. I guess that you end up with roughly the same names if you do helm install stable / prometheus. And even most likely, I won't have to think of the DNS name in your cluster because it's the default one.

Now, we have some dev ns, and you can launch a pod in it. This pod allows you to run the following command:
$ curl http://prometheus-kube-state-metrics.monitoring

prometheus-kube-state-metrics is one of the Prometheus exporters that collects metrics from the Kubernetes API itself. There is a lot of data about running in your cluster, what version it has, what problems you have with it.

As a simple example:
kube_pod_container_info{namespace=«kube-system",pod="kube-apiserver-k8s- 1",container="kube-apiserver",image=

"gcr.io/google-containers/kube-apiserver:v1.14.5"

,image_id="docker-pullable://gcr.io/google-containers/kube- apiserver@sha256:e29561119a52adad9edc72bfe0e7fcab308501313b09bf99df4a9638ee634989",container_id="docker://7cbe7b1fea33f811fdd8f7e0e079191110268f2 853397d7daf08e72c22d3cf8b"} 1

By making a simple curl request from an unprivileged pod, you can get information like this. If you do not know which version of Kubernetes you are running in — this is how you can easily find it out.

Hardening Kubernetes monitoring: Network Policy, Calico, RBAC-Proxy

The most exciting thing is that besides the fact that you are accessing kube-state-metrics, you can just as well contact Prometheus itself directly. You can collect metrics from there. You can even plot metrics from there. You can build a query l from a cluster in Prometheus that will just turn it off even in theory. And your monitoring will stop working from the cluster altogether.

The question already arises whether any external monitoring monitors your monitoring. I just got the opportunity to act in the Kubernetes cluster without any consequences for myself. You will not even know that I am working there since there is no monitoring anymore.

Like with the PSP, it feels like the problem is that all these fancy technologies - Kubernetes, Prometheus — just don't work and are full of holes. Not really.

There is a control called Network Policy. When correctly configured, it prohibits me from doing the items described above.

As in the example that I have shown, you can pull Kube state metrics from any namespace in the Kubernetes cluster without having any rights to do so. Network policies close the access from all other namespaces in the monitoring namespace, and that's it: no access, no problem. In all the charts that exist (both the standard Prometheus and the Prometheus in the operator), there simply is an option in the values of the Helm to enable network policies for them. You just need to turn it on, and these network policies will work.

There is just one little problem here. As a regular admin with a beard, you've probably decided that network policies are unnecessary. And after reading all sorts of articles on resources like DevOps.com, you agreed that flannel, especially with the host-gateway mode, is the best that you can choose.

What to do to avoid this? You can try to redeploy the network solution you have in the Kubernetes cluster, try replacing it with something more functional. For the same Calico, for example.

However, I want to say right away that the task of changing the network solution in the Kubernetes working cluster is a rather nontrivial one.

The problem here is straightforward to solve. The cluster has certificates, and you know that your certificates will rot in a year. Well, it is a specific solution with rotting certificates in the cluster — why are we going to fuss, raise a new cluster next to it, let all the certificates in the old one, and redeploy everything to the new one. True, everything will lie down for a day when it rots, but here's a new cluster.

When you launch a new cluster, replace flannel with Calico at the same time.

What if you have certificates issued for a hundred years and will not redeploy the cluster? There is such a tool as Kube-RBAC-Proxy. This is an excellent solution, it allows you to embed itself as a sidecar container to any pod in the Kubernetes cluster. It adds authorisation to this pod through RBAC of Kubernetes itself.

There is just one little problem. Previously, this Kube-RBAC-Proxy solution was embedded into the Prometheus operator. But then he was gone. Now modern versions rely on the fact that you have a network policу and close with them. And so you have to rewrite the chart a little. If you go to this repository, there are examples of using it as sidecars, and the charts will have to be rewritten minimally.

There is one more small problem. Prometheus isn't the only one giving its metrics to anyone. All the components of the Kubernetes cluster also know how to give their metrics away.

But as I said, if you cannot access the cluster and collect information, you can make harm.

So I'll quickly show you two ways you can ruin your Kubernetes cluster performance. You will laugh when I tell you this, but these are two real-life cases.

Denial of Service Attack method #1: Exhaustion of resources

We launch another special pod. It will have such a section.

resources:
requests:
cpu: 4
memory: 4Gi

As you know, requests are the CPU and memory amount that the host reserves for specific pods. If we have a four-core host in the Kubernetes cluster, and there comes a pod with four CPU requests, then not a single pod with requests to this host will be able to arrive there.

If I run this pod, then I issue such a command:
$ kubectl scale special-pod --replicas=...

No one else can deploy to the Kubernetes cluster because all the nodes will run out of requests. This way, I will stop your Kubernetes cluster. If I do this in the evening, then I can halt it for quite a while.

If we look at the Kubernetes documentation again, we will see a thing called Limit Range. It sets up resources for cluster objects. You can write a Limit Range object in yaml, apply it to specific namespaces — and further in this namespace; you can say that you have the default maximum and minimum resources for pods.

With such a thing, we can restrict users in specific product namespaces of teams to indicate any nasty stuff on their pods. But unfortunately, even if you tell the user that they cannot run pods with requests for more than one CPU, there is a beautiful scale command, well, or through the dashboard, they can run scale.

Denial of Service Attack method #2: Crashing the cluster with pod deployments

That's a story - it was late in the evening, and I was about to leave the office. I saw a group of developers sitting in a corner and doing something frantically with laptops. I go up to the guys and ask: "What has happened?"

A little earlier, at nine o'clock in the evening, one of the developers was going home. He decided: "I'm going to scale my application down to one instance". He pressed key "1@, and the Internet froze a little. He smashed this one critical again, and again, he clicked on the Enter key. He poked everything he could. Then the Internet came to life — and everything began to scale up to 11 111 111 111 111.

True, this story did not take place on Kubernetes. At that time, it was Nomad. It ended with the fact that after an hour of our attempts to stop Nomad from persistent attempts to scale, Nomad stated that he would not stop scaling and would not do anything else. And crashed.

Naturally, I tried to do the same on Kubernetes. Scaling to eleven billion pods did not please Kubernetes; it responded: "I can't. Exceeds internal caps." But it could launch 1,000,000,000 pods. In response to one billion, the Kube did not crash. It started to scale. The further the process went, the more time it took to create new pods. But still, the process went on.

If I can run pods in my namespace indefinitely, then even without requests and limits, I can run several such pods with some tasks that will start exhausting the nodes with the help of these tasks memory and the CPU. When I launch so many pods, information from them should go to the repository, that is, etcd. And when too much information arrives there, the storage begins to respond too slowly - and Kubernetes starts to slow down.

That's a problem - the control elements of Kubernetes are not one central gimmick but several components. There, in particular, there is a controller, a manager, a Scheduler, and so on. All these tools will start doing unnecessary, stupid work simultaneously, which will begin to take more and more time as it goes on. The controller manager will create new pods. The Scheduler will try to find a new node for them. You will likely run out of new nodes in your cluster soon. The Kubernetes cluster will start to run slower and slower.

But I decided to go even further. As you know, Kubernetes has a concept called a Service. By default, the Service is running using IP tables in your clusters. If you run one billion pods, for example, and then use a script to force Kubernetes to create new services:
for i in {1..1111111}; do
kubectl expose deployment test --port 80 \
--overrides="{\"apiVersion\": \"v1\",
\"metadata\": {\"name\": \"nginx$i\"}}";
done

On cluster nodes, more new IPtables rules will be generated almost instantly. Moreover, for each service, one billion IPtables rules will be generated.

I checked this on several thousand pods, up to ten. At this threshold already, gaining SSH access to a node is quite problematic to do. Because the packets, going through a such number of rules, start to perform not very well.

This is all solved with the help of Kubernetes, too. There is a Resource quota object. It sets the number of available resources and objects for the namespace in the cluster. We can create a yaml object in each namespace of the Kubernetes cluster. Using this object, we can say that we have allocated a certain number of requests, limits for this namespace, and then we can say that it is possible to create ten services and ten pods in this namespace. As a result - within the proper environment developer can press 1 for hours without doing no harm. Kubernetes will tell him: "You cannot scale your pods to that number because the resource quota is exceeded." That's it - the problem is solved. Documentation is here.

One problematic point emerges in this regard. You can now feel how difficult it is to create a namespace in Kubernetes. To make it, we need to consider a bunch of things.

Resource quota + Limit Range + RBAC

Create a namespace
Create a LimitRange inside
Create ResourceQuota inside
Create a Service account for CI
Create Rolebinding for CI and users
Optionally launch the required service pods

Therefore, taking this opportunity, I would like to share my developments. There is such an item called the SDK operator. This is the way in the Kubernetes cluster to write operators for it using Ansible.

At first, we had it written in Ansible, and then I looked to see what the SDK operator is and rewrote the Ansible role into the operator. This operator allows you to create an object in the Kubernetes cluster called a command. Within a command, it will enable you to describe the environment for that command in yaml. Within the team's environment, it allows us to explain that we allocate this many resources and no more - find it here.

Conclusions

First, Pod Security Policy is effective security control. Even though none of the Kubernetes installers uses PSP by default, you still need to use PSP in your clusters.

Network Policy is also of paramount importance.

LimitRange / ResourceQuota — it's time for you to start using those. We started using it a long time ago, and I was certain undocumented features that everyone and their dog were used for a long time. It turned out to be quite a rarity.

Some other things are so worth mentioning. For example, under some conditions, cubelets in the Kubernetes cluster can give the warlocks directory contents to an unauthorised user and an extensive analysis of Kubernetes vulnerabilities has been released.

There are scripts used if you want to reproduce everything I told you. There are files with production examples of how ResourceQuota, Pod Security Policy look like. Feel free to play with them.

You can also read Part 1 - privilege escalation in Kubernetes.

Kubernetes Cluster Security Checklist - 11 practices

1. Apply mutual authentication of cluster components using certificates.

2. Ensure secrets rotation.

3. Enable audit logs, configure parsing and corresponding alerts.

4. Implement cluster update mechahism supporting frequent updates for the cluster itself and its components.

5. Configure firewall for cluster service ports.

6. Enable and configure network access policies.

7. Implement RBAC authorisation including control of RBAC roles.

8. Apply authorisation of external users to the cluster through the company's central user storage (i.e. LDAP) with roles, etc., + some sort of OAuth (i.e. Dex).

8. Enable and fine tune Pod Security Policies.

9. Enable and configurd LimitRange and ResourceQuota for all namespaces.

10. Implement separate clusters for Dev / Stage and Production.

11. Ensure only CI tool has access to launch new abstractions and with strict control of specific rights and namespaces for each CI project.

This should be enough to stop most of attackers. Sure it adds lots of work to the DevOps / SRE teams. This is precisely one of the core security principles - security is paid by usability.

Pavel Selivanov

Mail.Ru Cloud Solutions, Senior DevOps Engineer