Service traffic handling in OpenShift 4.x, OVN and SDN selection algorithms and more

Updated -

This article discusses the different balancing algorithms that are selected for traffic management in OpenShift 4, (at time of writing, through OpenShift 4.15).

Table of contents

Enabling Session-affinity on a service
OVN-Kubernetes overview
OVN-Kubernetes balancing update
OpenShift-SDN overview

Enabling Session-affinity on a service of type: Loadbalancer

By default, session-affinity (sticky sessions) is disabled for services and the requests should be randomly distributed among the backend pods. In other words, the default option for services is to disable session-affinity (sticky sessions) for traffic headed to pods via service. This means that all traffic headed to a service will route to a different pod each time by default, which may lead to undesirable results.

However, factors such as HTTP's Keep-Alive and session affinity will play a role in the distribution of the requests. More details below:

  • Keep alive will impact the requests as long as the requests come from the same connection. The connection would stay open if the idle time between requests does not exceed the keep-alive timeout.

  • Session Affinity (or stick sessions) can impact the request distribution by redirecting the requests to a particular pod. Enable Session Affinity via service's spec: spec.sessionAffinity: ClientIP. (where the default is none, i.e. disabled). This will ensure that traffic is routed to the same backend pod each time via the ClientIP of the client pod making the request.

See more here: https://kubernetes.io/docs/reference/networking/virtual-ips/#session-affinity and the OpenShift docs page here: https://docs.openshift.com/container-platform/4.13/rest_api/network_apis/service-v1.html

Complete explanation below:

An overview of OVN-Kubernetes and open-vswitch

In brief details:

The short version is that OVN uses OVS algorythm dp-hash to weigh incoming traffic requests and match to an available matching bucket (pod). This has the net result of being effectively a random selection of backends. However, due to the nature of the selection methodology, this selection is imperfect, and can result in uneven distribution of traffic among backends when routed via services (as opposed to a "smart" selection service like HAProxy, Nginx or Service Mesh)

In more details:

With OVN-Kubernetes, we are using the hashing algorythm dp-hash as a default selection as a part of open-vswitch to route requests to different buckets (backends/pods) based on the incoming information being routed to the service. The hash weights requests and distributes them to the backends, which has a randomizing effect which is efficient for distribution. However, it is expected that due to the nature of the data that is being hashed that the selections may not always be equal or evenly distributed among the backends.

Default hash fields include source and destination Ethernet addresses, VLAN ID, source and destination IP addresses, and source and destination TCP/UDP ports.

Read more on this here - snippet below

+### Q: How does OVS divide flows among buckets in an OpenFlow "select" group?
+
+A: In Open vSwitch 2.3 and earlier, Open vSwitch used the destination
+   Ethernet address to choose a bucket in a select group.
+
+   Open vSwitch 2.4 and later by default hashes the source and
+   destination Ethernet address, VLAN ID, Ethernet type, IPv4/v6
+   source and destination address and protocol, and for TCP and SCTP
+   only, the source and destination ports.  The hash is "symmetric",
+   meaning that exchanging source and destination addresses does not
+   change the bucket selection.
+
+   Select groups in Open vSwitch 2.4 and later can be configured to
+   use a different hash function, using a Netronome extension to the
+   OpenFlow 1.5+ group_mod message.  For more information, see
+   Documentation/group-selection-method-property.txt in the Open
+   vSwitch source tree.  (OpenFlow 1.5 support in Open vSwitch is still
+   experimental.)
+

Hashes will never give a perfect split, but distribution may be affected by total backends, larger backend total (more pods) may improve the distribution ratios. Read more on this here

What is dp-hash?

              dp_hash
                     Use  a  datapath computed hash value.  The hash algorithm
                     varies   across   different   datapath   implementations.
                     dp_hash   uses   the   upper   32   bits  of  the  selec‐
                     tion_method_param as the datapath hash  algorithm  selec‐
                     tor.   The  supported values are 0 (corresponding to hash
                     computation over the IP 5-tuple) and 1 (corresponding  to
                     a  symmetric  hash computation over the IP 5-tuple).  Se‐
                     lecting specific fields with the  fields  option  is  not
                     supported  with  dp_hash).  The lower 32 bits are used as
                     the hash basis.

                     Using dp_hash has the advantage that it does not  require
                     the  generated  datapath  flows  to exact match any addi‐
                     tional packet header fields.  For example, even if multi‐
                     ple TCP connections thus hashed to different select group
                     buckets have different source port numbers, generally all
                     of  them would be handled with a small set of already es‐
                     tablished datapath flows, resulting in less  latency  for
                     TCP  SYN  packets.  The downside is that the shared data‐
                     path flows must match each packet twice, as the  datapath
                     hash  value  calculation  happens only when needed, and a
                     second match is required to match some bits of its value.
                     This  double-matching  incurs  a small additional latency
                     cost for each packet, but this latency is orders of  mag‐
                     nitude  less  than  the  latency of creating new datapath
                     flows for new TCP connections.

Every session, dp-hash gets translated in the kernel datapath (OVS DP flows) to whatever is returned by skb_get_hash(skb). For locally generated traffic (which is the case with OVN networked pods) this returns skb->hash which is taken from the TCP sock structure itself and is set to a random value.

A new session that uses the same tuple might get a different skb->l4_hash, which leads to a random distribution of traffic to backends.

When traffic is originating from inside the cluster (pod to service to pod) the session is assigned a random hash value every time, to avoid overlaps or repeated traffic routing to the same backend. Regardless of namespace, if the client starting the session is a pod then the hash will be random, which ensures no session affinity that might otherwise be present from an internal client.

See this Bug that discusses selection of dp-hashing as default, and this git page that talks about hash ordering.

See this medium article highlight - snippet below that discusses the benefits/issues of consistent hashing which is what the dp-hash function is:

Then there’s consistent hashing. Consistent hashing uses a more elaborate scheme, where each server is assigned multiple hash values based on its name or ID, and each request is assigned to the server with the “nearest” hash value. The benefit of this added complexity is that when a server is added or removed, most requests will map to the same server that they did before.
Since the dp-hash function is selecting based on values that are variable on incoming request, the matched backend is going to have some variance on selection.

The ultimate takeaway here should be: Using a service to route traffic in OVN-Kubernetes is expected to result in a random selection order across available backends, and some backends are going to be hit more frequently than others due to the nature of the Hash service that is in use. While it is true that OVN core does support the ability to change the hash selection from dp-hash to hash Changing the hash or selection method is not supported/available in current OpenShift releases; this option is not exposed in the version we ship.

If you are interested in a more even distribution of traffic to your backends or a managed selection ordering option like RoundRobin, you will need to utilize an intelligent routing solution like ingress or a reverse proxy (HAProxy, nginx, service mesh, etc) to route traffic to these backends with more granularity/control.

OVN-Kubernetes Balancing Correction update:

  • An update to upstream OVS was introduced to try and address this balancing issue in dp-hash selection order and can be reviewed here
  • This is tracked in the following RFE: https://issues.redhat.com/browse/RFE-4200

  • Below is a snippet from the updates that helps illustrate the changes, more information is available at the above Commit.

Approach taken in this change is to ensure that the hash space is
at least 4 times larger than the number of buckets, but not larger
than the maximum allowed (256).  This provides a better distribution
while not unnecessarily exploding number of datapath flows for
services with not that many backends.

Here is some data to demonstrate why the 4 was chosen as a coefficient:

  coeff.  :   1         2         3         4         5        100
  -------------------------------------------------------------------
  AvgDiff : 43.1 %    27.1 %    18.3 %    15.1 %    13.6 %    10.9 %
  MaxDiff : 50.0 %    33.3 %    25.0 %    20.0 %    20.0 %    20.0 %
  AvgDev  : 24.9 %    13.4 %     8.5 %     6.9 %     6.1 %     4.8 %
  MaxDev  : 35.4 %    20.4 %    14.4 %    11.2 %    11.2 %    11.2 %
  --------+----------------------------------------------------------
    16    :   1         1         1         1         1         -
    32    :  17         9         6         5         4         -
    64    :  33        17        11         9         7         -
   128    :  65        33        22        17        13         1
   256    : 129        65        43        33        26         2
  --------+----------------------------------------------------------
           current                       proposed

Table shows average and maximum load difference (Diff) between backends
across groups with 1 to 64 equally weighted backends.  And it shows
average and maximum standard deviation (Dev) of load distribution for
the same.  For example, with a coefficient 2, the maximum difference
between two backends will be 33% and the maximum standard deviation
will be 20.4%.  With the current logic (coefficient of 1) we have
maximum difference as high as 50%, as shown with the example at the
beginning, with the standard deviation of 35.4%.

The bottom half of the table shows from how many backends we start to
use a particular number of buckets.  For example, with a coeff. 3
we will have 16 hashes for 1 to 5 buckets, 32 hashes for 6-10, 64
buckets for 11-21 and so on.

According to the table, the number 4 is about where we achieve a good
enough standard deviation for the load (11.2%) while still not creating
too many hashes for cases with low number of backends.  The standard
deviation also doesn't go down that much with higher coefficient.
  • The above change to selection balancing has been introduced in the following versions and later, which should significantly improve traffic distribution on versions of OpenShift running OVN-Kubernetes moving forward.
4.14.42
4.15.39
4.16.27
4.17.9

========================================================================================

An overview of Openshift-SDN and iptables

//in brief:
OpenShift-SDN relies on the kube-proxy handling selection of iptables to route traffic to backends selected by a service using ClusterIP (internal default traffic for pod-to-pod handling) and iptables doesn't have a loadbalancing strategy, it has a set of rules that can APPROXIMATE a loadbalancing strategy, and is in effect using random selection to choose the next pod in the list based on statistical selection rules.

//Deeper dive:
Services in Openshift-SDN rely on kube-proxy to integrate with iptables and create rules that approximate loadbalancing. Services are just a redirect rule that exists only as a pointer to route requests to the backends specified by the selector value.

See this excellent article discussing how kubernetes traffic is managed

Does iptables use round-robin?

No, iptables is primarily used for firewalls, and it is not designed to do load balancing.

[However, you could craft a smart set of rules that could make iptables behave like a load balancer.](https://scalingo.com/blog/iptables#load-balancing)

And this is precisely what happens in Kubernetes.

If you have three Pods, kube-proxy writes the following rules:

select Pod 1 as the destination with a likelihood of 33%. Otherwise, move to the next rule
choose Pod 2 as the destination with a probability of 50%. Otherwise, move to the following rule
select Pod 3 as the destination (no probability)
The compound probability is that Pod 1, Pod 2 and Pod 3 have all have a one-third chance (33%) to be selected.

iptables rules for three Pods
Also, there's no guarantee that Pod 2 is selected after Pod 1 as the destination.

The effect is that services using kube-proxy (openshift-sdn) are going to be handling requests with an effectively random selection method, as they are weighted based on previous pod selection choice.

It is important to note that this selection method cannot be changed/it is not supported to modify these selection rules. If the objective is to have more granular control or specific selection handling methods a more intelligent solution must be explored: HAProxy, Nginx, service mesh.

See additional documentation resources:

https://stackoverflow.com/questions/54865746/how-does-k8s-service-route-the-traffic-to-mulitiple-endpoints

https://stackoverflow.com/questions/49888133/kubernetes-service-cluster-ip-how-is-this-internally-load-balanced-across-diffe

https://kubernetes.io/docs/reference/networking/virtual-ips/#:~:text=By%20default%2C%20kube%2Dproxy%20in%20iptables%20mode%20chooses%20a%20backend%20at%20random.

https://kubernetes.io/docs/concepts/services-networking/service/#ips-and-vips

Comments