Cilium L2 Announcements with Gateway API

The Cilium L2 Announcements feature provides an active-passive failover IP for external IPs and LoadBalancer IPs. Although the official documentation mentions the term load balancing, it also makes clear that there is currently no traffic load balancing between nodes. If multiple service endpoints exist within the Kubernetes cluster, the traffic will still be distributed across those endpoints.

The feature provides a virtual IP address. At any given time, one node listens for ARP (IPv4) or NDP (IPv6) requests and responds to them. If this node goes down, another node will start responding to these requests and effectively take over the IP address. The feature is based on the Kubernetes lease mechanism.

Since neither ARP nor NDP requests cross L3 boundaries, this feature only works within the same L2 domain. In most cases, a Layer 2 domain corresponds to a single subnet and VLAN 1.

This guide provides further insights into how to implement the L2 Announcements feature and how to use it with the Gateway API.

I did not have success with this setup using Cilium v1.19. This tutorial is based on Cilium 1.20.0-pre.1. Please provide feedback if you have better results with stable patch releases. The issue with v1.19 was that the Gateway did not receive an IP address.

To use the L2 Announcements feature together with the Gateway API, you need to enable kube-proxy replacement. Replacing kube-proxy in an existing cluster will cause downtime. It is also a significant architectural change, so it is recommended either to rebuild the cluster or to plan for extended downtime with thorough post-deployment testing. In any case, testing in a staging environment is strongly recommended.

Cilium is not a lightweight CNI. However, enabling kube-proxy replacement reduces resource usage (by removing kube-proxy) and improves performance. It replaces iptables rules with eBPF. Especially in larger environments, lookups are performed in near-constant time, whereas iptables rules are evaluated sequentially. Additionally, observability is improved through Hubble.

Cilium uses Helm internally for deployment. Even when using the Cilium CLI, Helm is used under the hood. Therefore, this guide uses a values.yaml file for configuration.

Prerequisites

Kube-proxy replacement

Install Kubernetes without Kube-proxy (kubeadm)

You can skip the kube-proxy deployment during the kubeadm cluster initialization by adding the following parameter:

--skip-phases=addon/kube-proxy

Remove Kube-proxy from existing Kubernetes cluster

1# Remove Kube-proxy daemon-set
2kubectl delete ds -n kube-system kube-proxy
3# Remove Kube-proxy configmap
4kubectl delete cm -n kube-system kube-proxy

Installing Gateway API CRDs

Before enabling Gateway API support in Cilium, the required CRDs must be installed. This guide uses Gateway API version v1.4.1, which is supported by Cilium 1.20. If you are using a different version of Cilium, check the official documentation for compatibility.

 1BASEURL=https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.4.1/config/crd
 2while read CRD; do
 3  kubectl apply -f ${BASEURL}/${CRD}
 4done <<EOF
 5standard/gateway.networking.k8s.io_gatewayclasses.yaml
 6standard/gateway.networking.k8s.io_gateways.yaml
 7standard/gateway.networking.k8s.io_grpcroutes.yaml
 8standard/gateway.networking.k8s.io_httproutes.yaml
 9standard/gateway.networking.k8s.io_referencegrants.yaml
10standard/gateway.networking.k8s.io_backendtlspolicies.yaml
11experimental/gateway.networking.k8s.io_tlsroutes.yaml
12EOF

Installing Cilium CLI

The latest release of the Cilium CLI can be found here: https://github.com/cilium/cilium-cli/releases/latest/

For Linux you can run:

1curl  -sL https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz | sudo tar xzf - -C  /usr/local/bin/ cilium

Installing Cilium

Even if you plan to customize your setup further, the key values for this guide are:

 1# Enable kube-proxy replacement
 2kubeProxyReplacement: true
 3k8sServiceHost: controlplane  # replace with your controlplane
 4k8sServicePort: 6443
 5
 6# Enable Gateway API
 7gatewayAPI:
 8  enabled: true
 9
10# Enable L2announcements
11l2announcements:
12  enabled: true

Without kube-proxy, Cilium must connect to the Kubernetes API server directly, since the usual Service-based endpoint resolution provided by kube-proxy is not available.

Cilium can now be installed, using the values above:

1cilium install -f values.yaml --version=1.20.0-pre.1

During installation, you may see several warnings and error messages. These usually occur while required resources and services are still starting up and not fully ready yet. This is expected during the bootstrap phase.

You can monitor the installation progress with:

1cilium status --wait

The command will exit automatically as soon as Cilium is fully operational.

In case you made a mistake in the values.yaml, you can run:

1cilium upgrade -f values.yaml --version=1.20.0-pre.1
2kubectl rollout restart -n kube-system ds cilium cilium-envoy
3kubectl rollout restart -n kube-system deploy cilium-operator

L2announcements and LoadBalancing pools

To make L2announcements useful, LoadBalancer IP pools are used to assign IPs automatically to services. If your environment supports BGP, you can also use these IP pools with the Cilium BGP control plane instead of L2 announcements.

The simplest form of a CiliumLoadBalancerIPPool resource looks like this:

1apiVersion: "cilium.io/v2"
2kind: CiliumLoadBalancerIPPool
3metadata:
4  name: "lb-ip-pool"
5spec:
6  blocks:
7  - start: "10.0.3.5"
8    stop: "10.0.3.15"

You can also define CIDR blocks and restrict the IP pool to services with a specific label. This is especially helpful if not all nodes share the same L2 domain.

 1apiVersion: "cilium.io/v2"
 2kind: CiliumLoadBalancerIPPool
 3metadata:
 4  name: "lb-ip-pool"
 5spec:
 6  blocks:
 7  - cidr: "10.0.10.0/24"
 8  - cidr: "2004::0/112"
 9  - start: "10.0.3.5"
10    stop: "10.0.3.15"
11  serviceSelector:
12    matchLabels:
13      ippool: l2announced

Furthermore, you need to define an L2 Announcement Policy. While it offers several configuration options, it primarily determines on which network interfaces L2 announcements are performed. This is a crucial part of the setup, as the selected interface effectively defines the L2 domain. You also need to specify for which type of service the announcements should be performed (externalIP and/or loadBalancer).

 1---
 2apiVersion: "cilium.io/v2alpha1"
 3kind: CiliumL2AnnouncementPolicy
 4metadata:
 5  name: cilium-l2-announcement-policy
 6spec:
 7  interfaces:
 8  - ethX
 9  externalIPs: true
10  loadBalancerIPs: true

L2 Announcement Policies also allow you to specify the services for which announcements should take place and on which nodes they are permitted.

 1apiVersion: "cilium.io/v2alpha1"
 2kind: CiliumL2AnnouncementPolicy
 3metadata:
 4  name: l2announced-services
 5spec:
 6  serviceSelector:
 7    matchLabels:
 8      ippool: l2announced 
 9  nodeSelector:
10    matchExpressions:
11      - key: datacenter
12        operator: In
13        values:
14        - dc1
15        - dc2
16  interfaces:
17  - ethX
18  externalIPs: true
19  loadBalancerIPs: true

A single service can only be announced within one L2 domain. It cannot span multiple domains. By grouping nodes and services, you can operate multiple independent L2 domains within the same cluster, each providing its own failover IP range.

Gateway API

With the above steps, we have already prepared everything needed to create a Gateway and a Route. If you haven't worked with the Gateway API before, you may want to check the official documentation for a brief introduction. In general, the Gateway API is an attempt to replace the current Ingress resources. Many Ingress features were implemented via annotations, which made Ingress highly dependent on the underlying implementation (e.g., Traefik or Nginx Ingress). The Gateway API introduces a role-based setup (e.g., infrastructure providers, cluster operators, application developers). It implements many features that were formerly only available through annotations and provides better portability by reducing the number of custom annotations.

By default, when Gateway API support is enabled, Cilium provides a GatewayClass called cilium.

We can use this GatewayClass to define a simple Gateway:

 1apiVersion: gateway.networking.k8s.io/v1
 2kind: Gateway
 3metadata:
 4  name: central
 5  namespace: kube-system
 6spec:
 7  infrastructure:
 8    labels:
 9      ippool: l2announced
10  gatewayClassName: cilium
11  listeners:
12  - allowedRoutes:
13      namespaces:
14        from: All
15    name: http-listener
16    port: 80
17    protocol: HTTP

The above example creates a Gateway resource in the kube-system namespace, which listens on port 80 for HTTP traffic. It allows HTTPRoute resources to be implemented in all namespaces. In a larger setup, you will most definitely want to restrict the namespaces per Gateway.

We can apply this resource, and it should immediately receive an IP address from the load balancing pool. To verify this, you can run:

1kubectl get -n kube-system gateway central
2NAME      CLASS    ADDRESS    PROGRAMMED   AGE
3central   cilium   10.0.3.5   True         10s

If you followed the tutorial and used label selectors for your nodes and services, you must apply those labels. The label for the gateway has already been applied in the example above. The nodes have to be labeled via:

1kubectl label node node01 datacenter=dc1

We now need an HTTPRoute to use this Gateway. For testing purposes, we will set up a minimal service using Nginx.

1kubectl create deployment myapp --image=nginx:latest
2kubectl expose deployment myapp --name myservice --port 80

This service can then be referenced in an HTTPRoute resource:

 1apiVersion: gateway.networking.k8s.io/v1
 2kind: HTTPRoute
 3metadata:
 4  name: myroute
 5  namespace: default
 6spec:
 7  hostnames:
 8  - myservice.findichgut.net
 9  parentRefs:
10  - group: gateway.networking.k8s.io
11    kind: Gateway
12    name: central
13    namespace: kube-system
14  rules:
15  - backendRefs:
16    - group: ""
17      kind: Service
18      name: myservice
19      port: 80
20      weight: 1
21    matches:
22    - path:
23        type: PathPrefix
24        value: /

After the route has been created, we need to make sure that the test client knows how to reach the cluster. For this initial test, you can also use any node with L2 announcements (in the same L2 domain) within the cluster.

1curl -v -H "Host: myservice.findichgut.net" http://[YOU GATEWAY IP]

Verifying the setup

The verification described here is mainly intended for non‑production environments. Network sniffing may expose traffic from other participants and should only be performed in controlled or authorized environments.

To verify the setup, the following behavior should be observed:

  • Monitor ARP/NDP requests for the LoadBalancer IP
  • Identify which node responds to these requests
  • Simulate a node failure
  • Verify that another node takes over ARP/NDP responses

For the steps below, we need a Linux system that is within the same L2 domain as the l2announcement interface. The tests below cannot be performed on the control plane node.

Test setup

Prepare test services

You can use the service from the example above. You might want to scale it:

1kubectl scale deployment myapp  --replicas 2

If you would like to see that the load is distributed between the pods within the cluster, you can modify the index.html delivered by Nginx. For example, you can add the pod name to the output:

1kubectl get pods -l app=myapp -o name | xargs -I{} kubectl exec {} -- /usr/bin/bash -c "/usr/bin/echo {} > /usr/share/nginx/html/index.html"
2
3# Verify
4curl  -H "Host: myservice.findichgut.net" http://[YOU GATEWAY IP]
5pod/myapp-cc4ddb978-rqbrs

Monitor ARP/NDP requests for the LoadBalancer IP

ARP entries are cached locally by the operating system. Depending on the OS and kernel parameters, ARP cache entries may remain valid for varying periods of time. This can delay the observable failover behavior on client systems even though the L2 announcement has already switched to another node.

On most linux systems, you can observe the cache with the ip neigh command, to flush the cache run sudo ip neigh flush dev [NIC WITHIN THE L2 DOMAIN]

For testing purposes, I have set up an external system outside the cluster, which shares the same virtual switch and has an IP assigned from the same subnet as the LoadBalancer IP pool.

TCPdump

 1# Terminal 1:
 2sudo tcpdump -i [NIC WITHIN THE L2 DOMAIN]  -n arp host [YOUR GATEWAY IP]
 3tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
 4listening on enp0s9, link-type EN10MB (Ethernet), snapshot length 262144 bytes
 5
 6# Terminal 2:
 7# Flush ARP cache
 8sudo ip neigh flush dev [NIC WITHIN THE L2 DOMAIN]
 9ping -c 1 [YOUR GATEWAY IP]
10
11# Terminal 1 (Output):
1209:46:52.354321 ARP, Request who-has 10.0.10.5 tell 10.0.10.200, length 28
1309:46:52.355711 ARP, Reply 10.0.10.5 is-at 08:00:27:77:f7:f0, length 28
1409:47:04.012361 ARP, Request who-has 10.0.10.5 tell 10.0.10.200, length 28
1509:47:04.013646 ARP, Reply 10.0.10.5 is-at 08:00:27:77:f7:f0, length 28

In the above example output, my gateway IP is 10.0.10.5, and MAC 08:00:27:77:f7:f0 is answering ARP requests for this IP. You can now look through the systems and check which system uses this MAC address. If you have an additional IP configured on the interface that is used for L2 announcements on the nodes, you can also run:

 1# Terminal 1:
 2sudo tcpdump -l -i [NIC WITHIN THE L2 DOMAIN]  -n arp  | grep --line-buffered  '[MAC ADDRESS YOU IDENTIFIED EARLIER'
 3tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
 4
 5# Terminal 2:
 6ping -c 1 [FIRST NODE WHICH SHOULD RESPOND]
 7ping -c 1 [SECOND NODE WHICH SHOULD RESPOND]
 8...
 9
10# Terminal 1 (Output):
11listening on enp0s9, link-type EN10MB (Ethernet), snapshot length 262144 bytes
1210:03:27.682715 ARP, Reply 10.0.3.6 is-at 08:00:27:77:f7:f0, length 28

This shows that the same MAC is listening for 10.0.3.6, which in my case is node01.

ARPping

You can also use arping for this purpose. On Ubuntu/Debian the required package is called arping.

1sudo arping -I [NIC WITHIN THE L2 DOMAIN] [YOUR GATEWAY IP]
2ARPING 10.0.10.5
358 bytes from 08:00:27:34:31:b5 (10.0.10.5): index=0 time=638.880 usec
458 bytes from 08:00:27:34:31:b5 (10.0.10.5): index=1 time=10.925 usec
5
6sudo arping -I [NIC WITHIN THE L2 DOMAIN] [IP OF YOUR NODES]

IP command

Alternatively, you can also check the system’s ARP table:

1ip neigh show [YOUR GATEWAY IP]
210.0.10.5 dev enp0s9 lladdr 08:00:27:34:31:b5 STALE
3ip neigh show | grep '[MAC ADDRESS YOU IDENTIFIED EARLIER]'

Simulate a node failure

Depending on your environment, node failure can be simulated in different ways:

  • Power off the machine (most realistic failover test)
  • Disconnect the network interface or unplug the cable
  • Disable networking on the node
  • Stop kubelet and Cilium processes

Verify that another node takes over ARP/NDP responses

TCPdump

 1# Terminal 1:
 2sudo tcpdump -i enp0s9  -n arp host  10.0.10.5
 3tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
 4listening on enp0s9, link-type EN10MB (Ethernet), snapshot length 262144 bytes
 5
 6# Terminal 2:
 7ping -c 1 [YOUR GATEWAY IP]
 8
 9# Terminal 1 (Output):
1010:09:36.643221 ARP, Request who-has 10.0.10.5 tell 10.0.10.200, length 28
1110:09:37.666549 ARP, Request who-has 10.0.10.5 tell 10.0.10.200, length 28
1210:09:38.492097 ARP, Request who-has 10.0.10.5 (ff:ff:ff:ff:ff:ff) tell 10.0.10.5, length 46
1310:09:43.746907 ARP, Request who-has 10.0.10.5 tell 10.0.10.200, length 28
1410:09:43.751040 ARP, Reply 10.0.10.5 is-at 08:00:27:34:31:b5, length 28

As we can see, the MAC address has changed. After a few seconds, we can also run curl on our setup again. The failover was successful.

Further readings


  1. This is not a strict definition of a Layer 2 domain. The earlier statement that a layer 2 domain typically corresponds to a single subnet and VLAN is only a practical rule of thumb. For example, devices connected to the same switch within the same VLAN are always part of the same L2 domain, regardless of their IP subnet configuration. ↩︎