Networking

There are a lot of moving parts when looking at networking and Kubernetes. There are both Kubernetes specific components such as kube-proxy and CoreDNS but also cloud specific components such as the underlying VNET/VPC, load balancer, and NAT gateways. The page aims to describe the architecture and some of the limitations and problems which can be experienced when working with XKS to help with debugging networking issues.

Kubernetes

TBD

Node Local DNS

To lower DNS query latency and a number of other reasons we are using NodeLocal DNS in XKS.

Node Local DNS is an application that runs on each node and creates a loopback interface on each node together with a number of iptables rules. The iptables rules intercepts all the DNS traffic from all pods that is sent to the clusters DNS server.

Node Local DNS Configuration

To configure Node Local DNS you need to provide two values. The IP of the central DNS server in your cluster, you can find this by running: kubectl get svc kube-dns -n kube-system -o jsonpath={.spec.clusterIP}. The second value is a random IP that you know that nothing else in the cluster is ever going to use, in our case we used the example ip 169.254.20.10. These values are defined for you in XKS but it's good to know about them and where to find them.

Here you can view the example configuration provided in the docs.

Node Local DNS is built on top of CoreDNS and is plugin based. Depending on your needs you can easily enable new features. By default NodeLocal DNS don't log the DNS requests it gets but it can make it hard to debug.

In XKS we haven't enabled any debug logs ether but if you need to enable it all you need to do is to add log as part of the plugins defined in your yaml.

For example:

data:
  Corefile: |
    .:53 {
        errors
        log
        cache 30
        reload
        loop
        bind 169.254.20.10 10.0.0.10
        forward . __PILLAR__UPSTREAM__SERVERS__
        prometheus :9253
        }

For you as a XKS administrator the biggest chance to change is in the cache plugin. Instead of me trying to rewrite the docs I recommend you to read it but we have changed the default value and at the time of writing we use the following config:

data:
  Corefile: |
    .:53 {
        log
        errors
        cache {
                success 9984 30
                denial 9984 10
                prefetch 20 60s 15%
        }
        reload
        loop
        bind 169.254.20.10 10.0.0.10
        forward . /etc/resolv.conf
        prometheus :9253
        }

The prefetch feature allows us to automatically get DNS entries that is in the cache and automatically update it before the DNS TTL ends. Remember that the cache TTL won't change the TTL of your cached DNS entries. If the DNS entry have a TTL of 1 minute and the cache have a TTL of 5 minutes the DNS entry will disappear after 1 minute.

If you for example define a cache without setting success and denial but set the prefetch config the default TTL cache value will still be applied.

data:
  Corefile: |
    .:53 {
        log
        errors
        cache {
            prefetch 20 60s 15%
        }
        reload
        loop
        bind 169.254.20.10 10.0.0.10
        forward . /etc/resolv.conf
        prometheus :9253
        }

Node local DNS networkpolicy

Sadly when using NodeLocal DNS together with Networkpolicy and the Calico CNI you need to write a networkpolicy that instead of using label selectors on a pod level you need to write a ruler that will work on the node level What it doesn't say in the docs is that you need to define the internal vnet IP as well.

These are the same values that was defined when doing the configuration. The default values on XKS AKS is 169.254.20.10 and 10.0.0.10 and on AWS it's 169.254.20.10 and 172.20.0.10.

The needed networkpolicy exist by default in all the tenant namespaces and is called default-deny and is managed by terraform.

To view the rule run:

kubectl get networkpolicies default-deny -n <tenant>

Azure

XKS in Azure uses a single VNET with a single subnet per AKS cluster. The VNET and subnets are created by the core module. Additionally each AKS cluster also creates a load balancer. The load balancer is used for both ingress and egress traffic.

When a Kubernetes service of type LoadBalancer is created a new IP is attached to the load balancer. An Azure load balancer can have multiple IPs attached to it so unlike AWS it does not have to create a new load balancer.

During the creation of the AKS cluster a public IP prefix is attached to the load balancer for egress traffic. This ensures that all traffic egress with the same source IP, enabling the use of IP white listing in external sources. This does however mean that all outbound traffic will also go through the same load balancer as the incoming traffic. There is currently work underway to enable the use of managed NAT gateways for egress traffic in AKS, but it is currently in preview right now.

SNAT Exhaustion

Applications making large numbers of outgoing TCP or UDP connections to the same IP and port can cause an issue known as SNAT port exhaustion. This is mostly due to the network architecture in Azure and AKS. All of the outgoing traffic from AKS goes through the load balancer, and for each outgoing request the load balancer needs to allocate an SNAT port to receive the response. Each Azure load balancer will allocate 64000 SNAT ports. This may seem like a lot, but there is a caveat as AKS will limit the amount of SNAT ports per node. The amount of SNAT ports available per node depends on the amount of nodes per cluster.

Node Count	SNAT Ports per Node
1-50	1024
51-100	512
101-200	256
201-400	128
401-800	64
801-1000	32

A symptom of exhausting the SNAT ports is that outgoing requests will just fail. This is of course not a good situation, and may be hard to debug as a failing request could be caused by many different factors.

AWS

TBD

Kubernetes​

Node Local DNS​

Node Local DNS Configuration​

Node local DNS networkpolicy​

Azure​

SNAT Exhaustion​

Links​

AWS​