This story is about a big fintech company getting hacked. We'll call it TwinPeaks to preserve anonymity, or TP for short. The quieter we become, the more we hear. It was a cold winter Friday evening...
MOVING TO KUBERNETES - ADVICE AND MORE
Changing server architecture is always a big step for any project. If your company has arrived at this point, there is most likely a good reason for that, among which might be increasing stability, environments unification, quick autoscaling. Kubernetes best fits microserver architectures. As a matter of fact, the more distinct cluster entities are, the better. This allows:
- Setting precise limits to each service
- Establishing only necessary connections
- Choosing each service's unique Kubernetes entity type (Deployment, ReplicationController, DaemonSet, etc.)
Storing Persistent Data
When developing a project's architecture, we try to completely avoid storing data in files. Why?
- Most platforms (AWS, Google) only allow installing a block-level storage from a single point. It imposes limits on the containers' horizontal scaling used by Persistent Volume.
- When a file system contains a lot of files one starts having trouble accessing the file system, which significantly impedes general responsiveness of the resource.
Ways to avoid:
- We store static content in Object Storage. If that's Amazon we are dealing with, we use S3. If it's a hardware cluster: Ceph RDB for persistent storage and Ceph rados for S3.
- We try to store most of the data on DB and/or NoSQL storages (such as ElasticSearch).
- We store sessions and cache on in-memory databases (redis/memcache).
Nevertheless, if it is Persistent Volume that is required, it should be prepared properly.
- First, collect the list of ALL the catalogues which store persistent data. If you fail to do so, the data will be recorded without any errors, but after the container restart or migration to a different node, ALL data will be LOST.
- Try compiling the catalogues in such a way that all your data are stored in a single catalogue at the bottom, as it is necessary to be able to use only one Persistent Volume for one container. This rule is not always applicable, and sometimes it's simply necessary to distribute data among several PVs. Only application's architect, who knows the purpose of a persistent storage and intended data volume stored there, can give the ultimate answer.
- Select a suitable file system. Ext4 is good for most tasks, but sometimes choosing a more suitable file system can benefit performance.
- Select an optimal size PV. Don't worry, you will be able to easily extend it if necessary. However, if a file system is overloaded, resize will take even more resources and can affect performance.
When all requirements are met, make a yaml-file for Kubernetes Persistent Volume. In the case of AWS it may look like this:
apiVersion: v1 kind: PersistentVolume metadata: name: example-my-pv annotations: volume.beta.kubernetes.io/storage-class: "default" spec: capacity: storage: 100Gi accessModes: - ReadWriteOnce awsElasticBlockStore: volumeID: vol-0dc1fcf80ac20300a fsType: ext4
Note that in the example above we deliberately put the volumeID, so that the Kubernetes PV is assigned to a specific AWS Elastic Block Storage. We also put volume.beta.kubernetes.io/storage-class, so that it can use the same AWS EBS when creating a PV again. It is important if you want to be assigned to a specific EBS. Kubernetes uses dynamic EVS creation by default. If you don't put the volumeID and volume.beta.kubernetes.io/storage-class, Kubernetes will also create an EBS when creating a PV. When deleting the PV in Kubernetes, Amazon EBS will be removed as well.
Now you need PVC (Persistent Volume Claim), which you will mount into containers.
Once again, in the case of Amazon, its code may look like this:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: example-my-pvc annotations: volume.beta.kubernetes.io/storage-class: "default" spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi
You will be using example-my-pvc name to mountcontainers.
Now let's talk about the proper way to launch a container. We are going to discuss this topic in more detail later in this article, but leaping ahead I would like to say that for containers with PV, one needs to use Deployment or ReplicationController. As a rule, these entities presuppose scaling, but, as it has already been said, in the case of PV and a platform like AWS, horizontal scaling is out of discussion. Why do we need a replicable controller then? Well, because, if a container fails (if, say, a worker-node shuts down), the container could restore automatically, and everything would continue running as before. Apparently, you can't scale such RC or Deployment.
A controller which is using PV may look like this:
apiVersion: v1 kind: ReplicationController metadata: name: example-my-controller labels: name: example-my-controller spec: replicas: 1 template: metadata: labels: name: example-my-controller spec: containers: - name: example-my-controller image: example-my-controller:latest volumeMounts: - mountPath: /srv/data name: data volumes: - name: data persistentVolumeClaim: claimName: example-my-pvc
No big secret. We simply mount our PVC to /srv/data/ giving default access to writing .
Shipping a Container into a Network
Before getting down to explanations, I would like to tell how a network is set up in Kubernetes. There is a Controller and Service in Kubernetes, which are two independent entities. A Controller may have listening ports. However, to get access to them inside a cluster using Kubernetes Service is recommended.
Kubernetes Service announces a static IP address and DNS in Kubernetes' network. With the help of selectors, Kubernetes connects service ports with controller ports. Thus, by referring to the service address, the software binds to the controllers' open ports. If you are using several replicas (like in the case of Deployment or ReplicationController) - Kubernetes Service balances the traffic between them, while the containers can be on different node clusters. When you do the scaling, Kubernetes automatically activates new nodes to balance the load. It's extremely convenient!
We could talk how network functions in Kubernetes for many more pages, but within this article I would like to highlight the most important points:
- Activate only those ports that really need being accessed, you can always customize Kubernetes Service.
- Avoid using HostPort. Why? Firstly, ports may conflict on the host, if the same instance has several controller's replicas running. Secondly, it doesn't look pretty.
- Use Load Balancer on underlayer (for example, AWS Internal Load Balancer) for critical services. No matter what they say, it's more stable. On the downside, it requires a bit more money.
- If you need to introduce a service to the world, use type: LoadBalancer
- Think twice before using Ingress. In most cases, Kubernetes with type: LoadBalancer is enough. You can also customize SSL in it. Ingress is useful only when you have lots of domains, which you want to put on a single LoadBalancer and/or you need to automatically generate and prolong SSL-certificate LetsEncrypt.
Everybody knows how important logs are. When dealing with container architecture, there is one requirement: write your logs in stdout/stderror.
It is done so that you are able to use the working-with-logs interface and have absolute integration.
You can view logs in Kubernetes using several ways:
- kubectl logs - the easiest and clearest way, similar to docker logs. The command simply outputs a specific container's (Pod's) logs
- stern (https://github.com/wercker/stern) - a more complicated variation of the previous command. It does the same thing as kubectl logs, but outputs right away the logs for the entire group of containers using the name of the application. It's is very convenient, when you have more than one container.
- fluentd + ElasticSearch + Kibana - log aggregation in NoSQL storage with an opportunity to have access through web-interface, do search and filtering
- obviously, no one keeps you from going to the server and launching docker logs, but we don't recommend using such low-level operations and remain on a higher level, if possible
- If you want to pull out the most recent logs only, use flags -f and --tail.
Example:kubectl logs -f --tail=10 <pod-id>
It needs to be done in order not to flood the terminal with a logs' long history and see their update in real-time. Stern by default behaves as if you've already added flag -f.
- If a pod won't launch, you don't always have to go back and look at the logs. If a container failed to launch whatsoever, then it's worth looking kubectl describe pods <pod-id>
- in any case, try not to rely on containers' logs only, and additionally send important stuff to Sentry or similar log-aggregators
- if you are planning to use fluentd and a log aggregator - write one-line logs. However, if you'd rather use ES as your storage - write your logs in JSON. Multi-line logs are difficult to analyze in aggregators.
The first rule of the Dysnix team is replicate everything that requires high availability.
We recommend customizing three replicas minimum for all services that are listened to through the network even for single-zone clusters. Why is that? If a single node fails, users shouldn't find out about it.
The second rule of the Dysnix team goes: record the obvious resource limits (CPU, Memory) for all services which you are customizing. What does it have to do with replication? Everything. It is the CPU limit that becomes the basis for decision-making on horizontal pod autoscaling, and it is directly connected with replication. That was our smooth transition to the limitation issue.
If top and bottom resource limits are not set explicitly, an application runs the risk of locating on a cluster with initial lack of resource, or, even worse, consuming all server resources and destabilizing a node. That is why we strongly recommend setting the limits explicitly.
If you don't know how much resource your application consumes, set up a test environment without any limits and run some stress tests. Performance graphs will show the consumption limits within which the application functions without failure. Add 10-30% to those performance limits and set them as strict ones. Here's an example of controller description section with top and bottom resource limits:
resources: requests: cpu: 0.1 memory: 128Mi limits: cpu: 1 memory: 1024Mi
Here we mention that the application needs minimum one tenth of the core and 128Mb RAM, the top limit is the entire core and 1Gb RAM.
Keep in mind that autoscaling depends on the CPU only. Memory leaks or mistakenly set limits won't lead to autoscaling, they will only cause OOM.
Selecting a Controller
There are several ways to launch a container in a Kubernetes cluster. Here are the most popular ones, available now in a steady Kubernetes version:
- Replication Controller
- Cron Job
'Why so many?!' one might think.
Actually, when using Kubernetes in different projects it might seem they are too few.
Let's start with a simple one - Pod (https://kubernetes.io/docs/concepts/workloads/pods/pod/).
Pod is a single container which can be launched in a cluster in the same manner you would launch a container via docker run. Then you describe a yaml-file, execute its creation in a cluster, and there you go, a pod has been launched. You can have a look at its logs by logging in with the help of kubectl exec. However, if the pod fails or is deleted by someone else, your service won't launch by itself.
It's difficult for me to name a realistic independent Pod's application in production. Usually I consider it to be a constituent of other entities.
Job - is a rarely used tool, nevertheless used. It is similar to Pod in the sense that in can be launched once and it will stop after the process stops. The difference is that after Job stops, the logs and the execution status will remain recorded. It's convenient when you need to manually perform a command within a cluster and save the details of its execution in the history.
The rest entity types fall under Kubernetes Controllers group.
Deployment - suits most applications. It dramatically differs from Pods and Jobs when the pods that are a part of Deployment and are located on a lower level stop/fail/are deleted, these pods will restart immediately. It's convenient to roll out Deployment fast, do the scaling and even customize autoscaling. The only drawback might be connection breakups during the rollout.
Replication Controller - an excellent choice for HTTP services and other applications, which have critical downtime during the rollout. RC can execute a gradual smooth rollout with the help of Rolling update operation. It gives an opportunity to roll out the code without breaking up client connections. I believe that is its main difference from Deployment.
Nowadays there is a new generation entity named Replica Set.
DaemonSet – a sly entity that makes it possible to place a container on each working node. We use DS for monitoring systems deployment, as well as for some peculiar software. Also, we know about deployment instances implemented with the help of DS front-end web-server.
StatefulSet – an evenslyer entity that enables assigning unique predictable hostnames, having more stable Persistent Storages and executing soft deployment, scaling and removal. You can't do without StatefulSet in a High Availability RabbitMQ cluster. Other than that, we rarely use StatefulSet.
Cron Job - the name speaks for itself. When you need periodic command launches within a cluster, you can use Cron Job. When there isn't a separate entity in Kubernetes, we use a separately launched container with a regular cron.
I hope I helped you choose a cluster entity type and avoid falling into traps which we fell into while discovering the Kubernetes world.
We realize that these are not all the subtleties that can be described and foreseen in such a short manual. That is why I must mention that our team always pays attention to the peculiarities of each project when choosing the technologies and approaches which suit it best.
Daniel Yavorovych, Dysnix CO-Founder