Harnessing the Power of Cilium: A Guide to BGP Integration with Gateway API on IPv4
Introduction
This document describes setting up a simple K3d cluster with Cilium as CNI and Cilium Gateway API instead of Ingress Controller. It also describes using Cilium BGP Control Plane feature to announce Loadbalancer IPs. Before diving in, a small introduction on BGP
Border Gateway Protocol
Border Gateway Protocol (BGP) is the primary protocol used to exchange routing information between different autonomous systems on the internet. An autonomous system (AS) is a collection of IP networks and routers under the control of a single organisation that presents a common routing policy to the internet. Each router or peer in the network announces the route information, which the peers connected on the other end receives and pass the information on.
Before thinking BGP is cool, there are slight drawbacks. BGP trusts network operators to use it securely and send the right data. However, it lacks built-in security, so if mistakes or attacks happen, there can be problems.
Enough with the theory, lets get into the cool stuff
Setup
The setup is going to be
- a K3d cluster without CNI and kube-proxy disabled.
- Cilium is installed as CNI which also takes care of the routing which was handled by kube-proxy. BGP control plane is enabled, and a set of IP’s are configured for Loadbalancers and advertised with BGP.
- Deploy Gateway API instead of Ingress controller and test a single Gateway.
- FRR is deployed on the host machine to connect to these Loadbalancer IPs. Our setup looks like this:
So in here, we configure Cilium to act as a BGP peer and then advertise any new Loadbalancer IPs to the docker bridge IP(172.50.0.1/32) and FRR will also be configured to listen to the K3D node IPs with the docker bridge IP(172.50.0.1/32) as the router. So when the connection is established between FRR and cilium BGP, FRR will publish these new routes to the system route table, and the LoadBalancer IPs will be reachable via host system. Then we use HAProxy to make the Loadbalancer IP reachable from internet via the public IP of the server, because the IPs are not publicly routable
Pardon me for this setup, this is just for local setup. When running on cloud, the IPs will be from a public network, which can be reachable from internet, so when the IPs are advertised, they will be reachable from the internet and External DNS can be used to sync the hostnames with DNS Zones.
Prerequisites
Tools required before starting
- helm
- k3d
- docker
- cilium-cli (version >-1.14.1
- frr
- haproxy
Some things to check or enable (which i couldnt find in docs, but from github issues)
Enable Ipv4 Forwarding:
echo "net.ipv4.ip_forward=1" | sudo tee /etc/sysctl.d/01-sysctl.conf > /dev/null
sudo sysctl -p
Enable Socket match kernel module
The module is part of netfilter kernel module which matches ipv4 or ipv6 packets to associated sockets.
lsmod | grep xt_socket
If the output is empty, then enable it
sudo modprobe xt_socket -v
insmod /lib/modules/6.1.46-0-lts/kernel/net/ipv6/netfilter/nf_socket_ipv6.ko.gz
insmod /lib/modules/6.1.46-0-lts/kernel/net/ipv4/netfilter/nf_socket_ipv4.ko.gz
insmod /lib/modules/6.1.46-0-lts/kernel/net/netfilter/xt_socket.ko.gz
check again if the module is loaded
lsmod | grep xt_socket
xt_socket 16384 0
nf_socket_ipv4 16384 1 xt_socket
nf_socket_ipv6 20480 1 xt_socket
nf_defrag_ipv6 24576 3 nf_conntrack,xt_socket,xt_TPROXY
nf_defrag_ipv4 16384 3 nf_conntrack,xt_socket,xt_TPROXY
x_tables 61440 15 ip6table_filter,xt_conntrack,iptable_filter,ip6table_nat,xt_socket,xt_comment,ip6_tables,xt_TPROXY,xt_CT,iptable_raw,ip_tables,iptable_nat,ip6table_mangle,iptable_mangle,xt_mark
Lets start…
All configs and scripts can be found in this github repo
First create a separate docker network and call it cilium
.
docker network create \
--driver bridge \
--subnet "172.50.0.0/16" \
--gateway "172.50.0.1" \
--ip-range "172.50.0.0/16" \
"cilium"
To create a k3d cluster with cilium, we first need to run some commands to mount bpf
anf cgroups
. Its better if we run the commands by an k3d-entrypoint-cilium.sh
for k3d, rather running the commands after the container for k3d is up. So we create a script and mount it as entrypoint.
Create a k3d config file
Notable configurations:
kubeAPI.hostIP: 127.172.50.1
This is set like this so that we can access the k3d cluster via the docker interfacenodeLabels
This added so that later on, we can add the BGP peering policy based on the nodes. So that the policy can be applied to certain nodes only.- Most of the other configs are normal, like disabling default loadbalancer, kube-proxy, flannel-cni, adding a cluster and service cidr
Deploy k3d cluster
k3d cluster create -c k3d-config.yaml
When the cluster is deployed, there will be only couple of pods in pending state. This is normal as we need to deploy a CNI.
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5a17e139be15 rancher/k3s:v1.25.7-k3s1 "/bin/k3d-entrypoint…" 17 seconds ago Up 9 seconds k3d-cilium-cluster-agent-1
073d0f5d9b76 rancher/k3s:v1.25.7-k3s1 "/bin/k3d-entrypoint…" 17 seconds ago Up 9 seconds k3d-cilium-cluster-agent-0
3740871a7ecb rancher/k3s:v1.25.7-k3s1 "/bin/k3d-entrypoint…" 17 seconds ago Up 16 seconds 127.172.50.1:6443->6443/tcp k3d-cilium-cluster-server-0
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-597584b69b-7l77z 0/1 Pending 0 1s
kube-system metrics-server-5f9f776df5-sftwl 0/1 Pending 0 1s
kube-system local-path-provisioner-79f67d76f8-bg9b8 0/1 Pending 0 1s
And for Cilium, we need a couple of CRDs to be installed first.
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_gatewayclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_gateways.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_httproutes.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_tlsroutes.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.7.0/config/crd/experimental/gateway.networking.k8s.io_referencegrants.yaml
Cilium will be installed as a Helm chart. When installing Cilium as CNI as a replacement for kube-proxy, we need to add the k8sServiceHost
and k8sServicePort
values and set kubeProxyReplacement: strict
. Because we need cilium to be the one to do all the network routing.
ipam.mode: kubernetes
: To pool IPs from Kubernetes, other option would becluster-pool
, which will get IPs from a configured CIDR which is useful for multi-cluster configuration. For single cluster,kubernetes
will be fine.k8s-require-ipv4-pod-cidr
: This configuration will make cilium to use the configuredcluster-cidr
from the cluster. Otherwise, cilium will pick a range of10.0.0.0
and start using those IPs. For some reason, the cilium IPs are not being announced via BGP if the pod CIDRs are exported (will come to it later)tunnel:vxlan
This is the default mode, if tunnel is used. This is not necessary for single cluster, but will be good for a multi-cluster setup. Tunnel will also create additional encapsulation, which means its good for small setup, for larger clusters with many applications, then we need to adjust the MTU of the interfaces to use jumbo-frames or value of 9000. (If I am successful with changing this one, I will update or add a new page)- I have disabled the ingress controllers so that I can specifically install Gateway API
Install Cilium Helm chart
helm upgrade --install cilium cilium/cilium --version 1.14.1 \
--namespace=kube-system -f cilium-values.yaml
Wait for a while, so that the cilium operator will get installed, and then cilium pods will be deployed to each node and then all pods should be in running state.
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system cilium-operator-d5dd47988-58gn7 1/1 Running 0 59s
kube-system cilium-7bm9g 1/1 Running 0 59s
kube-system cilium-lm5zc 1/1 Running 0 59s
kube-system cilium-wh5pf 1/1 Running 0 59s
kube-system local-path-provisioner-79f67d76f8-bg9b8 1/1 Running 0 2m34s
kube-system coredns-597584b69b-7l77z 1/1 Running 0 2m34s
kube-system hubble-relay-6bb96bd796-d596q 1/1 Running 0 59s
kube-system metrics-server-5f9f776df5-sftwl 1/1 Running 0 2m34s
kube-system hubble-ui-869b75b895-s2m8f 2/2 Running 0 59s
Cilium status can also be checked with the cilium-cli tool
cilium status
/¯¯\
/¯¯\__/¯¯\ Cilium: OK
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Envoy DaemonSet: disabled (using embedded mode)
\__/¯¯\__/ Hubble Relay: OK
\__/ ClusterMesh: disabled
Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1
Deployment hubble-ui Desired: 1, Ready: 1/1, Available: 1/1
Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet cilium Desired: 3, Ready: 3/3, Available: 3/3
Containers: cilium Running: 3
hubble-relay Running: 1
hubble-ui Running: 1
cilium-operator Running: 1
Cluster Pods: 5/5 managed by Cilium
Helm chart version: 1.14.1
Image versions hubble-ui quay.io/cilium/hubble-ui:v0.12.0@sha256:1c876cfa1d5e35bc91e1025c9314f922041592a88b03313c22c1f97a5d2ba88f: 1
hubble-ui quay.io/cilium/hubble-ui-backend:v0.12.0@sha256:8a79a1aad4fc9c2aa2b3e4379af0af872a89fcec9d99e117188190671c66fc2e: 1
cilium-operator quay.io/cilium/operator-generic:v1.14.1@sha256:e061de0a930534c7e3f8feda8330976367971238ccafff42659f104effd4b5f7: 1
cilium quay.io/cilium/cilium:v1.14.1@sha256:edc1d05ea1365c4a8f6ac6982247d5c145181704894bb698619c3827b6963a72: 3
hubble-relay quay.io/cilium/hubble-relay:v1.14.1@sha256:db30e85a7abc10589ce2a97d61ee18696a03dc5ea04d44b4d836d88bd75b59d8: 1
Now we need to define a set of IPs for Loadbalancer ranges, when services are created as type Loadbalancer
or for Gateway API
and install it:
kubectl apply -f ippool.yaml
Previously, IPPool management and assigning of IPs to Loadbalancer services was handled by MetalLB. Since BGP control plane feature is within cilium, we might not need it. In previous versions of cilium, it would be best to use with MetalLB
Now we need to define a BGP Peering policy for cilium
nodeSelector
is added to apply the peering policy to the nodes that has the label. I have added this label to all nodes for nowlocalASN
: this is the localASN number for the cilium BGP routerexportPodCIDR
: This switch is to whether advertise the pod IPs or not. Currently we dont need it. Its cool to have it enabled and connect to the pod directly, but this is not really secure IMHO. So keeping itfalse
neighbors
: This is where we configure the neighbours for the BGP peering. We are setting it to the docker network gateway, as this is the entrypoint of communication between the docker network and host machine.peerASN
: The ASN number of the peer, it can be the same as the localASN, but for testing and to see if the connection is established, we will use a different private ASNserviceSelector
: This configuration is to set which service IP’s should be advertised. TO advertise all , theoperator
should be of the valueNotIn
Apply the Peering policy
kubectl apply -f bgpp.yaml
Check the status of the bgp with cilium-cli
cilium bgp peers
Node Local AS Peer AS Peer Address Session State Uptime Family Received Advertised
k3d-cilium-cluster-agent-0 64512 64513 172.50.0.1 active 0s ipv4/unicast 0 0
ipv6/unicast 0 0
k3d-cilium-cluster-agent-1 64512 64513 172.50.0.1 active 0s ipv4/unicast 0 0
ipv6/unicast 0 0
k3d-cilium-cluster-server-0 64512 64513 172.50.0.1 active 0s ipv4/unicast 0 0
ipv6/unicast 0 0
As you can see the Session State
is active
, which means the BGP router is running in cilium. Now to configure on the host machine as the next router, we will install and configure frr
sudo apt install frr -y
Enable BGP for frr, by editing /etc/frr/daemons
and change the value of bgpd
from no
to yes
Edit the file /etc/frr/frr.conf
and add the following config
- We are setting the localASN number as
64513
, because we used this number as the peerASN in cilium - We set the peer router-id as the gateway IP of the docker network created for the cluster
- We define each node in the cluster as neighbour in the configuration to receive announcements and set the remote-asn
Now start frr
sudo systemctl start frr
After a couple of seconds you can check the status in either cilium or frr for peering status
cilium bgp peers
Node Local AS Peer AS Peer Address Session State Uptime Family Received Advertised
k3d-cilium-cluster-agent-0 64512 64513 172.50.0.1 established 10m49s ipv4/unicast 3 1
ipv6/unicast 0 0
k3d-cilium-cluster-agent-1 64512 64513 172.50.0.1 established 10m47s ipv4/unicast 3 1
ipv6/unicast 0 0
k3d-cilium-cluster-server-0 64512 64513 172.50.0.1 established 10m51s ipv4/unicast 3 1
ipv6/unicast 0 0
sudo vtysh -c 'show bgp summary'
IPv4 Unicast Summary (VRF default):
BGP router identifier 172.50.0.1, local AS number 64513 vrf-id 0
BGP table version 18
RIB entries 0, using 920 bytes of memory
Peers 3, using 4338 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
172.50.0.2 4 64512 1619 1624 0 0 0 00:10:54 1 3 N/A
172.50.0.3 4 64512 1619 1620 0 0 0 00:10:52 1 3 N/A
172.50.0.4 4 64512 1619 1620 0 0 0 00:10:50 1 3 N/A
Total number of neighbors 3
The State/PfxRcd
status with frr shows the connection is established with connection time established in Up/Down
field. On cilium-cli, the Session State
is now changed to established
If you check the route table, you will see the default routes there, nothing interesting there.
route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 xxx.xxx.xxx.1 0.0.0.0 UG 0 0 0 eth0
10.21.0.0 172.50.0.2 255.255.255.0 UG 20 0 0 br-7821810f4fc8
10.21.1.0 172.50.0.3 255.255.255.0 UG 20 0 0 br-7821810f4fc8
10.21.2.0 172.50.0.4 255.255.255.0 UG 20 0 0 br-7821810f4fc8
xxx.xxx.xxx.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
172.50.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-7821810f4fc8
Lets create a Gateway and deploy an HTTP-route
gatewayClassName: cilium
: we will be using the default gateway controller created by the helm chartlisteners
: creating a listener configurationallowedRoutes
: adding access control, for now I am going to leave it open, we can use Selector and add a label and namespaces with this label will be able to use this gateway for adding HTTPRoutesparentRefs
: Mapping HTTPRoute to the proper Gatewayrules
: Adding rules for the HTTPRoute configurationbackendRefs
: Connecting the HTTPRoute to the Hubble UI service
Now deploy the config
kubectl apply -f hubble.yaml
lets now check Gateway, HTTPRoute and services
kubectl get httproute -n kube-system
NAME HOSTNAMES AGE
hubble 1m
kubectl get gateway -n kube-system
NAME CLASS ADDRESS PROGRAMMED AGE
shared-gw cilium 172.50.201.37 True 1m
kubectl get services -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.201.0.10 <none> 53/UDP,53/TCP,9153/TCP 29m
metrics-server ClusterIP 10.201.130.169 <none> 443/TCP 29m
hubble-peer ClusterIP 10.201.167.135 <none> 443/TCP 28m
hubble-relay ClusterIP 10.201.219.116 <none> 80/TCP 28m
hubble-ui ClusterIP 10.201.196.229 <none> 80/TCP 28m
cilium-gateway-shared-gw LoadBalancer 10.201.222.86 172.50.201.37 80:31031/TCP 1m20s
Nice, we have a Service Loadbalancer for our Gateway object from cilium
Check the host route table again
route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 xxx.xxx.xxx.1 0.0.0.0 UG 0 0 0 eth0
10.21.0.0 172.50.0.2 255.255.255.0 UG 20 0 0 br-7821810f4fc8
10.21.1.0 172.50.0.3 255.255.255.0 UG 20 0 0 br-7821810f4fc8
10.21.2.0 172.50.0.4 255.255.255.0 UG 20 0 0 br-7821810f4fc8
xxx.xxx.xxx.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
172.50.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-7821810f4fc8
172.50.201.37 172.50.0.2 255.255.255.255 UGH 20 0 0 br-7821810f4fc8
Now the advertised route via BGP is updated in the host machine route table
If you check with FRR, to see the advertised routes by taking a neighbor
sudo vtysh -c 'show ip bgp neighbors 172.50.0.3 advertised-routes'
BGP table version is 19, local router ID is 172.50.0.1, vrf id 0
Default local pref 100, local AS 64513
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*> 10.21.0.0/24 0.0.0.0 0 64512 i
*> 10.21.1.0/24 0.0.0.0 0 64512 i
*> 10.21.2.0/24 0.0.0.0 0 64512 i
*> 172.50.201.37/32 0.0.0.0 0 64512 i
Total number of prefixes 4
Now lets do a curl to see if we can view the page
curl http://172.50.201.37/
<!doctype html><html><head><meta charset="utf-8"/><title>Hubble UI</title><base href="/"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width,user-scalable=0,initial-scale=1,minimum-scale=1,maximum-scale=1"/><link rel="icon" type="image/png" sizes="32x32" href="favicon-32x32.png"/><link rel="icon" type="image/png" sizes="16x16" href="favicon-16x16.png"/><link rel="shortcut icon" href="favicon.ico"/><script defer="defer" src="bundle.main.104f057a7d45238d9d45.js"></script><link href="bundle.main.3818224e482785607640.css" rel="stylesheet"></head><body><div id="app"></div></body></html>
And there it is !!!
Local DNS testing of route
Please note that the domain name I used is a test one, please replace it with your own domain name. The domain name from registrar and on the http-route should match
Since this is a local setup, we cannot add hostnames to the HTTPRoute and expect that DNS will work. We can make it work by modifying the HTTPRoute to add hostname and point the hostname in /etc/hosts
of the host machine. Lets try that now
As you can see, its the same HTTPRoute, but with a slight difference. We add hostnames there. Lets apply and check the status of HTTPRoute
kubectl apply -f hubble-with-hostnames.yaml
kubectl get httproute -n kube-system
NAME HOSTNAMES AGE
hubble ["hubble.example.com"] 22m
As you can see the hostname is added. But this will not work now, because the domain name is not resolvable. Lets try to add it in hosts file and check
echo "172.50.201.37 hubble.example.com" | sudo tee -a /etc/hosts>/dev/null
and now with curl
curl hubble.example.com
<!doctype html><html><head><meta charset="utf-8"/><title>Hubble UI</title><base href="/"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width,user-scalable=0,initial-scale=1,minimum-scale=1,maximum-scale=1"/><link rel="icon" type="image/png" sizes="32x32" href="favicon-32x32.png"/><link rel="icon" type="image/png" sizes="16x16" href="favicon-16x16.png"/><link rel="shortcut icon" href="favicon.ico"/><script defer="defer" src="bundle.main.104f057a7d45238d9d45.js"></script><link href="bundle.main.3818224e482785607640.css" rel="stylesheet"></head><body><div id="app"></div></body></html>
Nice!!!
Before continuing, don’t forget to remove the entries from /etc/hosts
HAProxy Setup
Lets try to make it DNS resolvable. One way to do it is to update the domain A record with the IP address of the server and use HAProxy to route the request to backend Loadbalancer IP that is advertised
Like explained before, when running on cloud, we can make use of external DNS to update the records. In local setups, HAProxy is king :D
So lets test with HAProxy
sudo apt install haproxy -y
Now replace /etc/haproxy/haproxy.cfg
file with the following config
The important part is the last section, where the frontend and backend is defined, the other config is there by default.
We are checking the domain name and forwarding it to the proper backend configured with LoadBalancer IP
So now is the time to configure DNS with the public IP of the server and point it to the backend. I replaced my domain name with the example.com domain.
Now start HAProxy
sudo systemctl start haproxy
Go to browser and go to the domain “hubble.example.com”
Please note that the domain name I used is a test one, please replace it with your own domain name. The domain name from registrar and on the http-route should match
Conclusion
Cilium is an excellent CNI tool and has started to bring Gateway API into existence which has higher functions compared to Ingress controller. The capabilities and features cannot be compared. Here I presented to you how cilium can be utilized in a simple small cluster setup. This can also be used in large scale, but the configuration and the tools will differ.
I hope you enjoyed this blog
Update: A pure-IPv6 stack based kubernetes deployment with k3d blog