Harnessing the Power of Cilium: A Guide to BGP Integration with Gateway API on IPv4

13 min readSep 14, 2023

Photo by Sam Moghadam Khamseh on Unsplash

Introduction

This document describes setting up a simple K3d cluster with Cilium as CNI and Cilium Gateway API instead of Ingress Controller. It also describes using Cilium BGP Control Plane feature to announce Loadbalancer IPs. Before diving in, a small introduction on BGP

Border Gateway Protocol

Border Gateway Protocol (BGP) is the primary protocol used to exchange routing information between different autonomous systems on the internet. An autonomous system (AS) is a collection of IP networks and routers under the control of a single organisation that presents a common routing policy to the internet. Each router or peer in the network announces the route information, which the peers connected on the other end receives and pass the information on.

Before thinking BGP is cool, there are slight drawbacks. BGP trusts network operators to use it securely and send the right data. However, it lacks built-in security, so if mistakes or attacks happen, there can be problems.

Enough with the theory, lets get into the cool stuff

Setup

The setup is going to be

a K3d cluster without CNI and kube-proxy disabled.
Cilium is installed as CNI which also takes care of the routing which was handled by kube-proxy. BGP control plane is enabled, and a set of IP’s are configured for Loadbalancers and advertised with BGP.
Deploy Gateway API instead of Ingress controller and test a single Gateway.
FRR is deployed on the host machine to connect to these Loadbalancer IPs. Our setup looks like this:

So in here, we configure Cilium to act as a BGP peer and then advertise any new Loadbalancer IPs to the docker bridge IP(172.50.0.1/32) and FRR will also be configured to listen to the K3D node IPs with the docker bridge IP(172.50.0.1/32) as the router. So when the connection is established between FRR and cilium BGP, FRR will publish these new routes to the system route table, and the LoadBalancer IPs will be reachable via host system. Then we use HAProxy to make the Loadbalancer IP reachable from internet via the public IP of the server, because the IPs are not publicly routable

Pardon me for this setup, this is just for local setup. When running on cloud, the IPs will be from a public network, which can be reachable from internet, so when the IPs are advertised, they will be reachable from the internet and External DNS can be used to sync the hostnames with DNS Zones.

Prerequisites

Tools required before starting

helm
k3d
docker
cilium-cli (version >-1.14.1
frr
haproxy

Some things to check or enable (which i couldnt find in docs, but from github issues)

Enable Ipv4 Forwarding:

echo "net.ipv4.ip_forward=1" | sudo tee /etc/sysctl.d/01-sysctl.conf > /dev/null
sudo sysctl -p

Enable Socket match kernel module

The module is part of netfilter kernel module which matches ipv4 or ipv6 packets to associated sockets.

lsmod | grep xt_socket

If the output is empty, then enable it

sudo modprobe xt_socket -v
insmod /lib/modules/6.1.46-0-lts/kernel/net/ipv6/netfilter/nf_socket_ipv6.ko.gz
insmod /lib/modules/6.1.46-0-lts/kernel/net/ipv4/netfilter/nf_socket_ipv4.ko.gz
insmod /lib/modules/6.1.46-0-lts/kernel/net/netfilter/xt_socket.ko.gz

check again if the module is loaded

lsmod | grep xt_socket
xt_socket 16384 0
nf_socket_ipv4 16384 1 xt_socket
nf_socket_ipv6 20480 1 xt_socket
nf_defrag_ipv6 24576 3 nf_conntrack,xt_socket,xt_TPROXY
nf_defrag_ipv4 16384 3 nf_conntrack,xt_socket,xt_TPROXY
x_tables 61440 15 ip6table_filter,xt_conntrack,iptable_filter,ip6table_nat,xt_socket,xt_comment,ip6_tables,xt_TPROXY,xt_CT,iptable_raw,ip_tables,iptable_nat,ip6table_mangle,iptable_mangle,xt_mark

Lets start…

All configs and scripts can be found in this github repo

First create a separate docker network and call it cilium .

docker network create \
      --driver bridge \
      --subnet "172.50.0.0/16" \
      --gateway "172.50.0.1" \
      --ip-range "172.50.0.0/16" \
      "cilium"

To create a k3d cluster with cilium, we first need to run some commands to mount bpf anf cgroups. Its better if we run the commands by an k3d-entrypoint-cilium.sh for k3d, rather running the commands after the container for k3d is up. So we create a script and mount it as entrypoint.

k3d-entrypoint-cilium.sh

Create a k3d config file

k3d-config.yaml

Notable configurations:

kubeAPI.hostIP: 127.172.50.1 This is set like this so that we can access the k3d cluster via the docker interface
nodeLabels This added so that later on, we can add the BGP peering policy based on the nodes. So that the policy can be applied to certain nodes only.
Most of the other configs are normal, like disabling default loadbalancer, kube-proxy, flannel-cni, adding a cluster and service cidr

Deploy k3d cluster

 k3d cluster create -c k3d-config.yaml

When the cluster is deployed, there will be only couple of pods in pending state. This is normal as we need to deploy a CNI.

docker ps
CONTAINER ID   IMAGE                      COMMAND                  CREATED          STATUS          PORTS                         NAMES
5a17e139be15   rancher/k3s:v1.25.7-k3s1   "/bin/k3d-entrypoint…"   17 seconds ago   Up 9 seconds                                  k3d-cilium-cluster-agent-1
073d0f5d9b76   rancher/k3s:v1.25.7-k3s1   "/bin/k3d-entrypoint…"   17 seconds ago   Up 9 seconds                                  k3d-cilium-cluster-agent-0
3740871a7ecb   rancher/k3s:v1.25.7-k3s1   "/bin/k3d-entrypoint…"   17 seconds ago   Up 16 seconds   127.172.50.1:6443->6443/tcp   k3d-cilium-cluster-server-0

kubectl get pods -A
NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE
kube-system   coredns-597584b69b-7l77z                  0/1     Pending   0          1s
kube-system   metrics-server-5f9f776df5-sftwl           0/1     Pending   0          1s
kube-system   local-path-provisioner-79f67d76f8-bg9b8   0/1     Pending   0          1s

And for Cilium, we need a couple of CRDs to be installed first.

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_gatewayclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_gateways.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_httproutes.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_tlsroutes.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_referencegrants.yaml

Cilium will be installed as a Helm chart. When installing Cilium as CNI as a replacement for kube-proxy, we need to add the k8sServiceHost and k8sServicePort values and set kubeProxyReplacement: strict . Because we need cilium to be the one to do all the network routing.

cilium-values.yaml

ipam.mode: kubernetes : To pool IPs from Kubernetes, other option would be cluster-pool , which will get IPs from a configured CIDR which is useful for multi-cluster configuration. For single cluster, kubernetes will be fine.
k8s-require-ipv4-pod-cidr : This configuration will make cilium to use the configured cluster-cidr from the cluster. Otherwise, cilium will pick a range of 10.0.0.0 and start using those IPs. For some reason, the cilium IPs are not being announced via BGP if the pod CIDRs are exported (will come to it later)
tunnel:vxlan This is the default mode, if tunnel is used. This is not necessary for single cluster, but will be good for a multi-cluster setup. Tunnel will also create additional encapsulation, which means its good for small setup, for larger clusters with many applications, then we need to adjust the MTU of the interfaces to use jumbo-frames or value of 9000. (If I am successful with changing this one, I will update or add a new page)
I have disabled the ingress controllers so that I can specifically install Gateway API

Install Cilium Helm chart

helm upgrade --install cilium cilium/cilium --version 1.14.1 \
       --namespace=kube-system -f cilium-values.yaml

Wait for a while, so that the cilium operator will get installed, and then cilium pods will be deployed to each node and then all pods should be in running state.

kubectl get pods -A
NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE
kube-system   cilium-operator-d5dd47988-58gn7           1/1     Running   0          59s
kube-system   cilium-7bm9g                              1/1     Running   0          59s
kube-system   cilium-lm5zc                              1/1     Running   0          59s
kube-system   cilium-wh5pf                              1/1     Running   0          59s
kube-system   local-path-provisioner-79f67d76f8-bg9b8   1/1     Running   0          2m34s
kube-system   coredns-597584b69b-7l77z                  1/1     Running   0          2m34s
kube-system   hubble-relay-6bb96bd796-d596q             1/1     Running   0          59s
kube-system   metrics-server-5f9f776df5-sftwl           1/1     Running   0          2m34s
kube-system   hubble-ui-869b75b895-s2m8f                2/2     Running   0          59s

Cilium status can also be checked with the cilium-cli tool

cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:             OK
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    disabled (using embedded mode)
 \__/¯¯\__/    Hubble Relay:       OK
    \__/       ClusterMesh:        disabled

Deployment             hubble-relay       Desired: 1, Ready: 1/1, Available: 1/1
Deployment             hubble-ui          Desired: 1, Ready: 1/1, Available: 1/1
Deployment             cilium-operator    Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet              cilium             Desired: 3, Ready: 3/3, Available: 3/3
Containers:            cilium             Running: 3
                       hubble-relay       Running: 1
                       hubble-ui          Running: 1
                       cilium-operator    Running: 1
Cluster Pods:          5/5 managed by Cilium
Helm chart version:    1.14.1
Image versions         hubble-ui          quay.io/cilium/hubble-ui:v0.12.0@sha256:1c876cfa1d5e35bc91e1025c9314f922041592a88b03313c22c1f97a5d2ba88f: 1
                       hubble-ui          quay.io/cilium/hubble-ui-backend:v0.12.0@sha256:8a79a1aad4fc9c2aa2b3e4379af0af872a89fcec9d99e117188190671c66fc2e: 1
                       cilium-operator    quay.io/cilium/operator-generic:v1.14.1@sha256:e061de0a930534c7e3f8feda8330976367971238ccafff42659f104effd4b5f7: 1
                       cilium             quay.io/cilium/cilium:v1.14.1@sha256:edc1d05ea1365c4a8f6ac6982247d5c145181704894bb698619c3827b6963a72: 3
                       hubble-relay       quay.io/cilium/hubble-relay:v1.14.1@sha256:db30e85a7abc10589ce2a97d61ee18696a03dc5ea04d44b4d836d88bd75b59d8: 1

Now we need to define a set of IPs for Loadbalancer ranges, when services are created as type Loadbalancer or for Gateway API

and install it:

kubectl apply -f ippool.yaml

Previously, IPPool management and assigning of IPs to Loadbalancer services was handled by MetalLB. Since BGP control plane feature is within cilium, we might not need it. In previous versions of cilium, it would be best to use with MetalLB

Now we need to define a BGP Peering policy for cilium

nodeSelector is added to apply the peering policy to the nodes that has the label. I have added this label to all nodes for now
localASN : this is the localASN number for the cilium BGP router
exportPodCIDR : This switch is to whether advertise the pod IPs or not. Currently we dont need it. Its cool to have it enabled and connect to the pod directly, but this is not really secure IMHO. So keeping it false
neighbors: This is where we configure the neighbours for the BGP peering. We are setting it to the docker network gateway, as this is the entrypoint of communication between the docker network and host machine.
peerASN: The ASN number of the peer, it can be the same as the localASN, but for testing and to see if the connection is established, we will use a different private ASN
serviceSelector: This configuration is to set which service IP’s should be advertised. TO advertise all , the operator should be of the value NotIn

Apply the Peering policy

kubectl apply -f bgpp.yaml

Check the status of the bgp with cilium-cli

cilium bgp peers
Node                          Local AS   Peer AS   Peer Address   Session State   Uptime   Family         Received   Advertised
k3d-cilium-cluster-agent-0    64512      64513     172.50.0.1     active          0s       ipv4/unicast   0          0
                                                                                           ipv6/unicast   0          0
k3d-cilium-cluster-agent-1    64512      64513     172.50.0.1     active          0s       ipv4/unicast   0          0
                                                                                           ipv6/unicast   0          0
k3d-cilium-cluster-server-0   64512      64513     172.50.0.1     active          0s       ipv4/unicast   0          0
                                                                                           ipv6/unicast   0          0

As you can see the Session State is active, which means the BGP router is running in cilium. Now to configure on the host machine as the next router, we will install and configure frr

sudo apt install frr -y

Enable BGP for frr, by editing /etc/frr/daemons and change the value of bgpd from no to yes

Edit the file /etc/frr/frr.conf and add the following config

We are setting the localASN number as 64513, because we used this number as the peerASN in cilium
We set the peer router-id as the gateway IP of the docker network created for the cluster
We define each node in the cluster as neighbour in the configuration to receive announcements and set the remote-asn

Now start frr

sudo systemctl start frr

After a couple of seconds you can check the status in either cilium or frr for peering status

cilium bgp peers
Node                          Local AS   Peer AS   Peer Address   Session State   Uptime   Family         Received   Advertised
k3d-cilium-cluster-agent-0    64512      64513     172.50.0.1     established     10m49s   ipv4/unicast   3          1
                                                                                           ipv6/unicast   0          0
k3d-cilium-cluster-agent-1    64512      64513     172.50.0.1     established     10m47s   ipv4/unicast   3          1
                                                                                           ipv6/unicast   0          0
k3d-cilium-cluster-server-0   64512      64513     172.50.0.1     established     10m51s   ipv4/unicast   3          1
                                                                                           ipv6/unicast   0          0

sudo vtysh -c 'show bgp summary'

IPv4 Unicast Summary (VRF default):
BGP router identifier 172.50.0.1, local AS number 64513 vrf-id 0
BGP table version 18
RIB entries 0, using 920 bytes of memory
Peers 3, using 4338 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
172.50.0.2      4      64512      1619      1624        0    0    0 00:10:54            1        3 N/A
172.50.0.3      4      64512      1619      1620        0    0    0 00:10:52            1        3 N/A
172.50.0.4      4      64512      1619      1620        0    0    0 00:10:50            1        3 N/A


Total number of neighbors 3

The State/PfxRcd status with frr shows the connection is established with connection time established in Up/Down field. On cilium-cli, the Session State is now changed to established

If you check the route table, you will see the default routes there, nothing interesting there.

route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         xxx.xxx.xxx.1    0.0.0.0         UG    0      0        0 eth0
10.21.0.0       172.50.0.2      255.255.255.0   UG    20     0        0 br-7821810f4fc8
10.21.1.0       172.50.0.3      255.255.255.0   UG    20     0        0 br-7821810f4fc8
10.21.2.0       172.50.0.4      255.255.255.0   UG    20     0        0 br-7821810f4fc8
xxx.xxx.xxx.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.50.0.0      0.0.0.0         255.255.0.0     U     0      0        0 br-7821810f4fc8

Lets create a Gateway and deploy an HTTP-route

gatewayClassName: cilium : we will be using the default gateway controller created by the helm chart
listeners : creating a listener configuration
allowedRoutes: adding access control, for now I am going to leave it open, we can use Selector and add a label and namespaces with this label will be able to use this gateway for adding HTTPRoutes
parentRefs: Mapping HTTPRoute to the proper Gateway
rules: Adding rules for the HTTPRoute configuration
backendRefs: Connecting the HTTPRoute to the Hubble UI service

Now deploy the config

kubectl apply -f hubble.yaml

lets now check Gateway, HTTPRoute and services

kubectl get httproute -n kube-system
NAME     HOSTNAMES   AGE
hubble               1m

kubectl get gateway -n kube-system
NAME        CLASS    ADDRESS         PROGRAMMED   AGE
shared-gw   cilium   172.50.201.37   True         1m

kubectl get services -n kube-system
NAME                       TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                  AGE
kube-dns                   ClusterIP      10.201.0.10      <none>          53/UDP,53/TCP,9153/TCP   29m
metrics-server             ClusterIP      10.201.130.169   <none>          443/TCP                  29m
hubble-peer                ClusterIP      10.201.167.135   <none>          443/TCP                  28m
hubble-relay               ClusterIP      10.201.219.116   <none>          80/TCP                   28m
hubble-ui                  ClusterIP      10.201.196.229   <none>          80/TCP                   28m
cilium-gateway-shared-gw   LoadBalancer   10.201.222.86    172.50.201.37   80:31031/TCP             1m20s

Nice, we have a Service Loadbalancer for our Gateway object from cilium

Check the host route table again

route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         xxx.xxx.xxx.1    0.0.0.0         UG    0      0        0 eth0
10.21.0.0       172.50.0.2      255.255.255.0   UG    20     0        0 br-7821810f4fc8
10.21.1.0       172.50.0.3      255.255.255.0   UG    20     0        0 br-7821810f4fc8
10.21.2.0       172.50.0.4      255.255.255.0   UG    20     0        0 br-7821810f4fc8
xxx.xxx.xxx.0   0.0.0.0         255.255.255.0   U     0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.50.0.0      0.0.0.0         255.255.0.0     U     0      0        0 br-7821810f4fc8
172.50.201.37   172.50.0.2      255.255.255.255 UGH   20     0        0 br-7821810f4fc8

Now the advertised route via BGP is updated in the host machine route table

If you check with FRR, to see the advertised routes by taking a neighbor

sudo vtysh -c 'show ip bgp neighbors 172.50.0.3 advertised-routes'
BGP table version is 19, local router ID is 172.50.0.1, vrf id 0
Default local pref 100, local AS 64513
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.21.0.0/24     0.0.0.0                                0 64512 i
*> 10.21.1.0/24     0.0.0.0                                0 64512 i
*> 10.21.2.0/24     0.0.0.0                                0 64512 i
*> 172.50.201.37/32 0.0.0.0                                0 64512 i

Total number of prefixes 4

Now lets do a curl to see if we can view the page

curl http://172.50.201.37/
<!doctype html><html><head><meta charset="utf-8"/><title>Hubble UI</title><base href="/"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width,user-scalable=0,initial-scale=1,minimum-scale=1,maximum-scale=1"/><link rel="icon" type="image/png" sizes="32x32" href="favicon-32x32.png"/><link rel="icon" type="image/png" sizes="16x16" href="favicon-16x16.png"/><link rel="shortcut icon" href="favicon.ico"/><script defer="defer" src="bundle.main.104f057a7d45238d9d45.js"></script><link href="bundle.main.3818224e482785607640.css" rel="stylesheet"></head><body><div id="app"></div></body></html>

And there it is !!!

Local DNS testing of route

Please note that the domain name I used is a test one, please replace it with your own domain name. The domain name from registrar and on the http-route should match

Since this is a local setup, we cannot add hostnames to the HTTPRoute and expect that DNS will work. We can make it work by modifying the HTTPRoute to add hostname and point the hostname in /etc/hosts of the host machine. Lets try that now

As you can see, its the same HTTPRoute, but with a slight difference. We add hostnames there. Lets apply and check the status of HTTPRoute

kubectl apply -f hubble-with-hostnames.yaml

kubectl get httproute -n kube-system
NAME     HOSTNAMES                AGE
hubble   ["hubble.example.com"]   22m

As you can see the hostname is added. But this will not work now, because the domain name is not resolvable. Lets try to add it in hosts file and check

echo "172.50.201.37 hubble.example.com" | sudo tee -a /etc/hosts>/dev/null

and now with curl

curl hubble.example.com
<!doctype html><html><head><meta charset="utf-8"/><title>Hubble UI</title><base href="/"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width,user-scalable=0,initial-scale=1,minimum-scale=1,maximum-scale=1"/><link rel="icon" type="image/png" sizes="32x32" href="favicon-32x32.png"/><link rel="icon" type="image/png" sizes="16x16" href="favicon-16x16.png"/><link rel="shortcut icon" href="favicon.ico"/><script defer="defer" src="bundle.main.104f057a7d45238d9d45.js"></script><link href="bundle.main.3818224e482785607640.css" rel="stylesheet"></head><body><div id="app"></div></body></html>

Nice!!!

Before continuing, don’t forget to remove the entries from /etc/hosts

HAProxy Setup

Lets try to make it DNS resolvable. One way to do it is to update the domain A record with the IP address of the server and use HAProxy to route the request to backend Loadbalancer IP that is advertised

Like explained before, when running on cloud, we can make use of external DNS to update the records. In local setups, HAProxy is king :D

So lets test with HAProxy

sudo apt install haproxy -y

Now replace /etc/haproxy/haproxy.cfg file with the following config

The important part is the last section, where the frontend and backend is defined, the other config is there by default.

We are checking the domain name and forwarding it to the proper backend configured with LoadBalancer IP

So now is the time to configure DNS with the public IP of the server and point it to the backend. I replaced my domain name with the example.com domain.

Now start HAProxy

sudo systemctl start haproxy

Go to browser and go to the domain “hubble.example.com”