Harnessing the Power of Cilium: A Guide to BGP Integration with Gateway API on IPv4

Allan John
13 min readSep 14, 2023

--

Photo by Sam Moghadam Khamseh on Unsplash

Introduction

This document describes setting up a simple K3d cluster with Cilium as CNI and Cilium Gateway API instead of Ingress Controller. It also describes using Cilium BGP Control Plane feature to announce Loadbalancer IPs. Before diving in, a small introduction on BGP

Border Gateway Protocol

Border Gateway Protocol (BGP) is the primary protocol used to exchange routing information between different autonomous systems on the internet. An autonomous system (AS) is a collection of IP networks and routers under the control of a single organisation that presents a common routing policy to the internet. Each router or peer in the network announces the route information, which the peers connected on the other end receives and pass the information on.

Before thinking BGP is cool, there are slight drawbacks. BGP trusts network operators to use it securely and send the right data. However, it lacks built-in security, so if mistakes or attacks happen, there can be problems.

Enough with the theory, lets get into the cool stuff

Setup

The setup is going to be

  • a K3d cluster without CNI and kube-proxy disabled.
  • Cilium is installed as CNI which also takes care of the routing which was handled by kube-proxy. BGP control plane is enabled, and a set of IP’s are configured for Loadbalancers and advertised with BGP.
  • Deploy Gateway API instead of Ingress controller and test a single Gateway.
  • FRR is deployed on the host machine to connect to these Loadbalancer IPs. Our setup looks like this:
Architecture for K3D with Cilium

So in here, we configure Cilium to act as a BGP peer and then advertise any new Loadbalancer IPs to the docker bridge IP(172.50.0.1/32) and FRR will also be configured to listen to the K3D node IPs with the docker bridge IP(172.50.0.1/32) as the router. So when the connection is established between FRR and cilium BGP, FRR will publish these new routes to the system route table, and the LoadBalancer IPs will be reachable via host system. Then we use HAProxy to make the Loadbalancer IP reachable from internet via the public IP of the server, because the IPs are not publicly routable

Pardon me for this setup, this is just for local setup. When running on cloud, the IPs will be from a public network, which can be reachable from internet, so when the IPs are advertised, they will be reachable from the internet and External DNS can be used to sync the hostnames with DNS Zones.

Prerequisites

Tools required before starting

  • helm
  • k3d
  • docker
  • cilium-cli (version >-1.14.1
  • frr
  • haproxy

Some things to check or enable (which i couldnt find in docs, but from github issues)

Enable Ipv4 Forwarding:

echo "net.ipv4.ip_forward=1" | sudo tee /etc/sysctl.d/01-sysctl.conf > /dev/null
sudo sysctl -p

Enable Socket match kernel module

The module is part of netfilter kernel module which matches ipv4 or ipv6 packets to associated sockets.

lsmod | grep xt_socket

If the output is empty, then enable it

sudo modprobe xt_socket -v
insmod /lib/modules/6.1.46-0-lts/kernel/net/ipv6/netfilter/nf_socket_ipv6.ko.gz
insmod /lib/modules/6.1.46-0-lts/kernel/net/ipv4/netfilter/nf_socket_ipv4.ko.gz
insmod /lib/modules/6.1.46-0-lts/kernel/net/netfilter/xt_socket.ko.gz

check again if the module is loaded

lsmod | grep xt_socket
xt_socket 16384 0
nf_socket_ipv4 16384 1 xt_socket
nf_socket_ipv6 20480 1 xt_socket
nf_defrag_ipv6 24576 3 nf_conntrack,xt_socket,xt_TPROXY
nf_defrag_ipv4 16384 3 nf_conntrack,xt_socket,xt_TPROXY
x_tables 61440 15 ip6table_filter,xt_conntrack,iptable_filter,ip6table_nat,xt_socket,xt_comment,ip6_tables,xt_TPROXY,xt_CT,iptable_raw,ip_tables,iptable_nat,ip6table_mangle,iptable_mangle,xt_mark

Lets start…

All configs and scripts can be found in this github repo

First create a separate docker network and call it cilium .

docker network create \
--driver bridge \
--subnet "172.50.0.0/16" \
--gateway "172.50.0.1" \
--ip-range "172.50.0.0/16" \
"cilium"

To create a k3d cluster with cilium, we first need to run some commands to mount bpf anf cgroups. Its better if we run the commands by an k3d-entrypoint-cilium.sh for k3d, rather running the commands after the container for k3d is up. So we create a script and mount it as entrypoint.

k3d-entrypoint-cilium.sh

Create a k3d config file

k3d-config.yaml

Notable configurations:

  • kubeAPI.hostIP: 127.172.50.1 This is set like this so that we can access the k3d cluster via the docker interface
  • nodeLabels This added so that later on, we can add the BGP peering policy based on the nodes. So that the policy can be applied to certain nodes only.
  • Most of the other configs are normal, like disabling default loadbalancer, kube-proxy, flannel-cni, adding a cluster and service cidr

Deploy k3d cluster

 k3d cluster create -c k3d-config.yaml

When the cluster is deployed, there will be only couple of pods in pending state. This is normal as we need to deploy a CNI.

docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5a17e139be15 rancher/k3s:v1.25.7-k3s1 "/bin/k3d-entrypoint…" 17 seconds ago Up 9 seconds k3d-cilium-cluster-agent-1
073d0f5d9b76 rancher/k3s:v1.25.7-k3s1 "/bin/k3d-entrypoint…" 17 seconds ago Up 9 seconds k3d-cilium-cluster-agent-0
3740871a7ecb rancher/k3s:v1.25.7-k3s1 "/bin/k3d-entrypoint…" 17 seconds ago Up 16 seconds 127.172.50.1:6443->6443/tcp k3d-cilium-cluster-server-0

kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-597584b69b-7l77z 0/1 Pending 0 1s
kube-system metrics-server-5f9f776df5-sftwl 0/1 Pending 0 1s
kube-system local-path-provisioner-79f67d76f8-bg9b8 0/1 Pending 0 1s

And for Cilium, we need a couple of CRDs to be installed first.

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_gatewayclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_gateways.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_httproutes.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_tlsroutes.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_referencegrants.yaml

Cilium will be installed as a Helm chart. When installing Cilium as CNI as a replacement for kube-proxy, we need to add the k8sServiceHost and k8sServicePort values and set kubeProxyReplacement: strict . Because we need cilium to be the one to do all the network routing.

cilium-values.yaml
  • ipam.mode: kubernetes : To pool IPs from Kubernetes, other option would be cluster-pool , which will get IPs from a configured CIDR which is useful for multi-cluster configuration. For single cluster, kubernetes will be fine.
  • k8s-require-ipv4-pod-cidr : This configuration will make cilium to use the configured cluster-cidr from the cluster. Otherwise, cilium will pick a range of 10.0.0.0 and start using those IPs. For some reason, the cilium IPs are not being announced via BGP if the pod CIDRs are exported (will come to it later)
  • tunnel:vxlan This is the default mode, if tunnel is used. This is not necessary for single cluster, but will be good for a multi-cluster setup. Tunnel will also create additional encapsulation, which means its good for small setup, for larger clusters with many applications, then we need to adjust the MTU of the interfaces to use jumbo-frames or value of 9000. (If I am successful with changing this one, I will update or add a new page)
  • I have disabled the ingress controllers so that I can specifically install Gateway API

Install Cilium Helm chart

helm upgrade --install cilium cilium/cilium --version 1.14.1 \
--namespace=kube-system -f cilium-values.yaml

Wait for a while, so that the cilium operator will get installed, and then cilium pods will be deployed to each node and then all pods should be in running state.

kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cilium-operator-d5dd47988-58gn7 1/1 Running 0 59s
kube-system cilium-7bm9g 1/1 Running 0 59s
kube-system cilium-lm5zc 1/1 Running 0 59s
kube-system cilium-wh5pf 1/1 Running 0 59s
kube-system local-path-provisioner-79f67d76f8-bg9b8 1/1 Running 0 2m34s
kube-system coredns-597584b69b-7l77z 1/1 Running 0 2m34s
kube-system hubble-relay-6bb96bd796-d596q 1/1 Running 0 59s
kube-system metrics-server-5f9f776df5-sftwl 1/1 Running 0 2m34s
kube-system hubble-ui-869b75b895-s2m8f 2/2 Running 0 59s

Cilium status can also be checked with the cilium-cli tool

cilium status
/¯¯\
/¯¯\__/¯¯\ Cilium: OK
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Envoy DaemonSet: disabled (using embedded mode)
\__/¯¯\__/ Hubble Relay: OK
\__/ ClusterMesh: disabled

Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1
Deployment hubble-ui Desired: 1, Ready: 1/1, Available: 1/1
Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet cilium Desired: 3, Ready: 3/3, Available: 3/3
Containers: cilium Running: 3
hubble-relay Running: 1
hubble-ui Running: 1
cilium-operator Running: 1
Cluster Pods: 5/5 managed by Cilium
Helm chart version: 1.14.1
Image versions hubble-ui quay.io/cilium/hubble-ui:v0.12.0@sha256:1c876cfa1d5e35bc91e1025c9314f922041592a88b03313c22c1f97a5d2ba88f: 1
hubble-ui quay.io/cilium/hubble-ui-backend:v0.12.0@sha256:8a79a1aad4fc9c2aa2b3e4379af0af872a89fcec9d99e117188190671c66fc2e: 1
cilium-operator quay.io/cilium/operator-generic:v1.14.1@sha256:e061de0a930534c7e3f8feda8330976367971238ccafff42659f104effd4b5f7: 1
cilium quay.io/cilium/cilium:v1.14.1@sha256:edc1d05ea1365c4a8f6ac6982247d5c145181704894bb698619c3827b6963a72: 3
hubble-relay quay.io/cilium/hubble-relay:v1.14.1@sha256:db30e85a7abc10589ce2a97d61ee18696a03dc5ea04d44b4d836d88bd75b59d8: 1

Now we need to define a set of IPs for Loadbalancer ranges, when services are created as type Loadbalancer or for Gateway API

and install it:

kubectl apply -f ippool.yaml

Previously, IPPool management and assigning of IPs to Loadbalancer services was handled by MetalLB. Since BGP control plane feature is within cilium, we might not need it. In previous versions of cilium, it would be best to use with MetalLB

Now we need to define a BGP Peering policy for cilium

  • nodeSelector is added to apply the peering policy to the nodes that has the label. I have added this label to all nodes for now
  • localASN : this is the localASN number for the cilium BGP router
  • exportPodCIDR : This switch is to whether advertise the pod IPs or not. Currently we dont need it. Its cool to have it enabled and connect to the pod directly, but this is not really secure IMHO. So keeping it false
  • neighbors: This is where we configure the neighbours for the BGP peering. We are setting it to the docker network gateway, as this is the entrypoint of communication between the docker network and host machine.
  • peerASN: The ASN number of the peer, it can be the same as the localASN, but for testing and to see if the connection is established, we will use a different private ASN
  • serviceSelector: This configuration is to set which service IP’s should be advertised. TO advertise all , the operator should be of the value NotIn

Apply the Peering policy

kubectl apply -f bgpp.yaml

Check the status of the bgp with cilium-cli

cilium bgp peers
Node Local AS Peer AS Peer Address Session State Uptime Family Received Advertised
k3d-cilium-cluster-agent-0 64512 64513 172.50.0.1 active 0s ipv4/unicast 0 0
ipv6/unicast 0 0
k3d-cilium-cluster-agent-1 64512 64513 172.50.0.1 active 0s ipv4/unicast 0 0
ipv6/unicast 0 0
k3d-cilium-cluster-server-0 64512 64513 172.50.0.1 active 0s ipv4/unicast 0 0
ipv6/unicast 0 0

As you can see the Session State is active, which means the BGP router is running in cilium. Now to configure on the host machine as the next router, we will install and configure frr

sudo apt install frr -y

Enable BGP for frr, by editing /etc/frr/daemons and change the value of bgpd from no to yes

Edit the file /etc/frr/frr.conf and add the following config

  • We are setting the localASN number as 64513, because we used this number as the peerASN in cilium
  • We set the peer router-id as the gateway IP of the docker network created for the cluster
  • We define each node in the cluster as neighbour in the configuration to receive announcements and set the remote-asn

Now start frr

sudo systemctl start frr

After a couple of seconds you can check the status in either cilium or frr for peering status

cilium bgp peers
Node Local AS Peer AS Peer Address Session State Uptime Family Received Advertised
k3d-cilium-cluster-agent-0 64512 64513 172.50.0.1 established 10m49s ipv4/unicast 3 1
ipv6/unicast 0 0
k3d-cilium-cluster-agent-1 64512 64513 172.50.0.1 established 10m47s ipv4/unicast 3 1
ipv6/unicast 0 0
k3d-cilium-cluster-server-0 64512 64513 172.50.0.1 established 10m51s ipv4/unicast 3 1
ipv6/unicast 0 0
sudo vtysh -c 'show bgp summary'

IPv4 Unicast Summary (VRF default):
BGP router identifier 172.50.0.1, local AS number 64513 vrf-id 0
BGP table version 18
RIB entries 0, using 920 bytes of memory
Peers 3, using 4338 KiB of memory

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
172.50.0.2 4 64512 1619 1624 0 0 0 00:10:54 1 3 N/A
172.50.0.3 4 64512 1619 1620 0 0 0 00:10:52 1 3 N/A
172.50.0.4 4 64512 1619 1620 0 0 0 00:10:50 1 3 N/A


Total number of neighbors 3

The State/PfxRcd status with frr shows the connection is established with connection time established in Up/Down field. On cilium-cli, the Session State is now changed to established

If you check the route table, you will see the default routes there, nothing interesting there.

route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 xxx.xxx.xxx.1 0.0.0.0 UG 0 0 0 eth0
10.21.0.0 172.50.0.2 255.255.255.0 UG 20 0 0 br-7821810f4fc8
10.21.1.0 172.50.0.3 255.255.255.0 UG 20 0 0 br-7821810f4fc8
10.21.2.0 172.50.0.4 255.255.255.0 UG 20 0 0 br-7821810f4fc8
xxx.xxx.xxx.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
172.50.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-7821810f4fc8

Lets create a Gateway and deploy an HTTP-route

  • gatewayClassName: cilium : we will be using the default gateway controller created by the helm chart
  • listeners : creating a listener configuration
  • allowedRoutes: adding access control, for now I am going to leave it open, we can use Selector and add a label and namespaces with this label will be able to use this gateway for adding HTTPRoutes
  • parentRefs: Mapping HTTPRoute to the proper Gateway
  • rules: Adding rules for the HTTPRoute configuration
  • backendRefs: Connecting the HTTPRoute to the Hubble UI service

Now deploy the config

kubectl apply -f hubble.yaml

lets now check Gateway, HTTPRoute and services

kubectl get httproute -n kube-system
NAME HOSTNAMES AGE
hubble 1m

kubectl get gateway -n kube-system
NAME CLASS ADDRESS PROGRAMMED AGE
shared-gw cilium 172.50.201.37 True 1m

kubectl get services -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.201.0.10 <none> 53/UDP,53/TCP,9153/TCP 29m
metrics-server ClusterIP 10.201.130.169 <none> 443/TCP 29m
hubble-peer ClusterIP 10.201.167.135 <none> 443/TCP 28m
hubble-relay ClusterIP 10.201.219.116 <none> 80/TCP 28m
hubble-ui ClusterIP 10.201.196.229 <none> 80/TCP 28m
cilium-gateway-shared-gw LoadBalancer 10.201.222.86 172.50.201.37 80:31031/TCP 1m20s

Nice, we have a Service Loadbalancer for our Gateway object from cilium

Check the host route table again

route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 xxx.xxx.xxx.1 0.0.0.0 UG 0 0 0 eth0
10.21.0.0 172.50.0.2 255.255.255.0 UG 20 0 0 br-7821810f4fc8
10.21.1.0 172.50.0.3 255.255.255.0 UG 20 0 0 br-7821810f4fc8
10.21.2.0 172.50.0.4 255.255.255.0 UG 20 0 0 br-7821810f4fc8
xxx.xxx.xxx.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
172.50.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-7821810f4fc8
172.50.201.37 172.50.0.2 255.255.255.255 UGH 20 0 0 br-7821810f4fc8

Now the advertised route via BGP is updated in the host machine route table

If you check with FRR, to see the advertised routes by taking a neighbor

sudo vtysh -c 'show ip bgp neighbors 172.50.0.3 advertised-routes'
BGP table version is 19, local router ID is 172.50.0.1, vrf id 0
Default local pref 100, local AS 64513
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*> 10.21.0.0/24 0.0.0.0 0 64512 i
*> 10.21.1.0/24 0.0.0.0 0 64512 i
*> 10.21.2.0/24 0.0.0.0 0 64512 i
*> 172.50.201.37/32 0.0.0.0 0 64512 i

Total number of prefixes 4

Now lets do a curl to see if we can view the page

curl http://172.50.201.37/
<!doctype html><html><head><meta charset="utf-8"/><title>Hubble UI</title><base href="/"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width,user-scalable=0,initial-scale=1,minimum-scale=1,maximum-scale=1"/><link rel="icon" type="image/png" sizes="32x32" href="favicon-32x32.png"/><link rel="icon" type="image/png" sizes="16x16" href="favicon-16x16.png"/><link rel="shortcut icon" href="favicon.ico"/><script defer="defer" src="bundle.main.104f057a7d45238d9d45.js"></script><link href="bundle.main.3818224e482785607640.css" rel="stylesheet"></head><body><div id="app"></div></body></html>

And there it is !!!

Local DNS testing of route

Please note that the domain name I used is a test one, please replace it with your own domain name. The domain name from registrar and on the http-route should match

Since this is a local setup, we cannot add hostnames to the HTTPRoute and expect that DNS will work. We can make it work by modifying the HTTPRoute to add hostname and point the hostname in /etc/hosts of the host machine. Lets try that now

As you can see, its the same HTTPRoute, but with a slight difference. We add hostnames there. Lets apply and check the status of HTTPRoute

kubectl apply -f hubble-with-hostnames.yaml
kubectl get httproute -n kube-system
NAME HOSTNAMES AGE
hubble ["hubble.example.com"] 22m

As you can see the hostname is added. But this will not work now, because the domain name is not resolvable. Lets try to add it in hosts file and check

echo "172.50.201.37 hubble.example.com" | sudo tee -a /etc/hosts>/dev/null

and now with curl

curl hubble.example.com
<!doctype html><html><head><meta charset="utf-8"/><title>Hubble UI</title><base href="/"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width,user-scalable=0,initial-scale=1,minimum-scale=1,maximum-scale=1"/><link rel="icon" type="image/png" sizes="32x32" href="favicon-32x32.png"/><link rel="icon" type="image/png" sizes="16x16" href="favicon-16x16.png"/><link rel="shortcut icon" href="favicon.ico"/><script defer="defer" src="bundle.main.104f057a7d45238d9d45.js"></script><link href="bundle.main.3818224e482785607640.css" rel="stylesheet"></head><body><div id="app"></div></body></html>

Nice!!!

Before continuing, don’t forget to remove the entries from /etc/hosts

HAProxy Setup

Lets try to make it DNS resolvable. One way to do it is to update the domain A record with the IP address of the server and use HAProxy to route the request to backend Loadbalancer IP that is advertised

Like explained before, when running on cloud, we can make use of external DNS to update the records. In local setups, HAProxy is king :D

So lets test with HAProxy

sudo apt install haproxy -y

Now replace /etc/haproxy/haproxy.cfg file with the following config

The important part is the last section, where the frontend and backend is defined, the other config is there by default.

We are checking the domain name and forwarding it to the proper backend configured with LoadBalancer IP

So now is the time to configure DNS with the public IP of the server and point it to the backend. I replaced my domain name with the example.com domain.

Now start HAProxy

sudo systemctl start haproxy

Go to browser and go to the domain “hubble.example.com”

Please note that the domain name I used is a test one, please replace it with your own domain name. The domain name from registrar and on the http-route should match

Conclusion

Cilium is an excellent CNI tool and has started to bring Gateway API into existence which has higher functions compared to Ingress controller. The capabilities and features cannot be compared. Here I presented to you how cilium can be utilized in a simple small cluster setup. This can also be used in large scale, but the configuration and the tools will differ.

I hope you enjoyed this blog

Update: A pure-IPv6 stack based kubernetes deployment with k3d blog

--

--