Integrating Cilium with Gateway API, IPv6, and BGP for Advanced Networking Solutions
Introduction
This document is simiar to a previous document, I have already published with the sane configuration but with ipv4. So if you already read the article, then you will notice a lot of text is being copied from there ;)
The setup architecture is similar to the previous document, a k3d cluster with 2 agents.
Setup
The setup is going to be
- a K3d cluster without CNI and kube-proxy disabled.
- Cilium is installed as CNI which also takes care of the routing which was handled by kube-proxy. BGP control plane is enabled, and a set of IP’s are configured for Loadbalancers and advertised with BGP.
- Deploy Gateway API instead of Ingress controller and test a single Gateway.
- FRR is deployed on the host machine to connect to these Loadbalancer IPs. Our setup looks like this:
Like the setup in the previous document, we follow the same setup. We configure Cilium to act as a BGP peer and then advertise any new Loadbalancer IPs to the docker bridge IP(172.50.0.1/32) and FRR will be configured to listen to the K3D node IPs with the docker bridge IP(172.50.0.1/32) as the router. So when the connection is established between FRR and cilium BGP, FRR will publish these new routes to the system route table, and the LoadBalancer IPs will be reachable via host system. Then we use HAProxy to make the Loadbalancer IP reachable from internet via the public IP of the server, because the IPs are not publicly routable
Due to limitations on pure IPv6 setup, on cilium, or because of the underlying host configuration(I tested with IPv6), Its not possible to setup IPv6 routing between docker and FRR. This is why we setup BGP communication to be over IPv4.
There was a lot of research in reddit forums and other tech forums, documentations and many trial and error configuration to end up with this configuration that works. You can use this configuration as a baseline and work on different configurations to make it work for use case.
Prerequistes
Enable IPv6 on Docker
Modify the /etc/docker/daemon.json
file and add the following information:
Since IPv6 IPs are supposed to be internet routable, there are some limitations with IPv6 as private IPs and docker also is not configured to work properly with IPv6 right out of the box. So for testing purposes, I have configured a random range. If you are testing, you can can use the same range or different range of your choice
Enable IPv6 kernel modules
Since cilium uses IPtables to write routes to kernel, we need to enable additional kernel modules
sudo modprobe -v ip6table_filter
sudo modprobe -v ip6_tables
sudo modprobe -v ip6table_mangle
sudo modprobe -v ip6table_raw
sudo modprobe -v iptable_nat
sudo modprobe -v ip6table_nat
sudo modprobe -v iptable_filter
sudo modprobe -v xt_socket
Add this configuration in kernel modules configuration file to persist the modules after restart
echo "ip6table_filter
ip6_tables
ip6table_mangle
ip6table_raw
iptable_nat
ip6table_nat
iptable_filter
xt_socket" | sudo tee /etc/modules-load.d/modules.conf
Check if the modules are loaded:
lsmod | grep xt_socket
xt_socket 16384 0
nf_socket_ipv4 16384 1 xt_socket
nf_socket_ipv6 20480 1 xt_socket
nf_defrag_ipv6 24576 2 nf_conntrack,xt_socket
nf_defrag_ipv4 16384 2 nf_conntrack,xt_socket
x_tables 57344 9 ip6table_filter,ip6table_raw,iptable_filter,ip6table_nat,xt_socket,ip6_tables,ip_tables,iptable_nat,ip6table_mangle
Ensure sysctl parameters are enabled
echo "net.core.devconf_inherit_init_net=1
net.netfilter.nf_conntrack_max=196608
net.ipv4.conf.all.forwarding = 1
net.ipv6.conf.all.forwarding = 1" | sudo tee /etc/sysctl.d/01-sysctl.conf > /dev/null
sudo sysctl -p
Tools required before starting
- helm
- k3d
- docker
- cilium-cli (version >1.14.1)
- frr (Can be used with Kube router or bird. Haven’t tried them yet. so..)
- haproxy
Lets deploy:
All configs and scripts can be found in this github repo
Create a docker network to start with. Make sure you are enabling IPv6. We are also using IPv4 addresses for the host machines
docker network create \
--driver bridge \
--subnet "172.50.0.0/16" \
--gateway "172.50.0.1" \
--ip-range "172.50.0.0/16" \
--ipv6 \
--subnet "2001:3200:3200::/64" \
"cilium"
Since K3D runs with docker, and docker has issues with IPv6, in our setup, we will run docker with IPv4 Gateway, which means the docker containers will have both IPv4 and IPv6 stack, however we will use the IPv6 the maximum and only use IPv4, when there is a limitation.
To create a k3d cluster with cilium, we first need to run some commands to mount bpf
anf cgroups
. Its better if we run the commands by an k3d-entrypoint-cilium.sh
for k3d, rather running the commands after the container for k3d is up. So we create a script and mount it as entrypoint.
Create a k3d config file
Note: K3D config does not support relative paths. Fix will be to replace the volume mount path to an absolute path.
Deploy k3d cluster
k3d cluster create -c k3d-ipv6-config.yaml
When the cluster is deployed, there will be only couple of pods in pending state. This is normal as we need to deploy a CNI.
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
cb7da708c3bb rancher/k3s:v1.30.1-k3s1 "/bin/k3d-entrypoint…" 5 minutes ago Up 4 minutes k3d-cilium-cluster-agent-1
86699ebe1fbb rancher/k3s:v1.30.1-k3s1 "/bin/k3d-entrypoint…" 5 minutes ago Up 4 minutes k3d-cilium-cluster-agent-0
b5856ce06ef3 rancher/k3s:v1.30.1-k3s1 "/bin/k3d-entrypoint…" 5 minutes ago Up 5 minutes 0.0.0.0:42663->6443/tcp k3d-cilium-cluster-server-0
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-576bfc4dc7-j6v46 0/1 Pending 0 5m17s
kube-system local-path-provisioner-75bb9ff978-xqcdj 0/1 Pending 0 5m17s
kube-system metrics-server-557ff575fb-lllqt 0/1 Pending 0 5m17s
And for Cilium, we need a couple of CRDs to be installed first.
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.1.0/config/crd/experimental/gateway.networking.k8s.io_gatewayclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.1.0/config/crd/experimental/gateway.networking.k8s.io_gateways.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.1.0/config/crd/experimental/gateway.networking.k8s.io_httproutes.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.1.0/config/crd/experimental/gateway.networking.k8s.io_tlsroutes.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.1.0/config/crd/experimental/gateway.networking.k8s.io_referencegrants.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.1.0/config/crd/experimental/gateway.networking.k8s.io_grpcroutes.yaml
Cilium will be installed as a Helm chart. The values for cilium:
Notable values:
tunnel: disabled
: Tunneling does not work properly with IPv6, so its best to disable it for now, until cilium issues a proper fix
bgpControlPlane.enabled: true
: We are deploying latest rc
version, to test the BGPClusterConfig which replaces BGPPeerConfig
enableIPv6BIGTCP: true
: This is something I want to test the performance. This is something to improve the performance of number of transactions per second. This would be better over JumboFrames, where all network devices in the network should support JumboFrames.
In the latest version, Envoy is deployed along with the cilium pods. Since the scope of the story is to deploy a pure IPv6 cluster, I am not exploring if i want to use it or get rid off it.
Installing helm chart now:
helm upgrade --install cilium cilium/cilium --version 1.16.0-rc.0 \
--namespace=kube-system -f cilium-values.yaml
Wait for a while, so that the cilium operator will get installed, and then cilium pods will be deployed to each node and then all pods should be in running state.
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system local-path-provisioner-75bb9ff978-xqcdj 1/1 Running 0 35m
kube-system coredns-576bfc4dc7-j6v46 1/1 Running 0 35m
kube-system metrics-server-557ff575fb-lllqt 1/1 Running 0 35m
kube-system cilium-operator-5bdff49868-nhl2m 1/1 Running 0 8m39s
kube-system cilium-bfjp2 1/1 Running 0 8m39s
kube-system cilium-frv2r 1/1 Running 0 8m39s
kube-system cilium-znx8l 1/1 Running 0 8m39s
kube-system cilium-envoy-wdmgs 1/1 Running 0 8m39s
kube-system cilium-envoy-jk5tq 1/1 Running 0 8m39s
kube-system cilium-envoy-xtqlt 1/1 Running 0 8m39s
kube-system hubble-ui-59bb4cb67b-bz72q 2/2 Running 0 8m39s
kube-system hubble-relay-75fb6597d7-7gzln 1/1 Running 0 8m39s
Cilium status can also be checked with the cilium-cli tool
cilium status
/¯¯\
/¯¯\__/¯¯\ Cilium: OK
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Envoy DaemonSet: OK
\__/¯¯\__/ Hubble Relay: OK
\__/ ClusterMesh: disabled
DaemonSet cilium Desired: 3, Ready: 3/3, Available: 3/3
Deployment hubble-ui Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet cilium-envoy Desired: 3, Ready: 3/3, Available: 3/3
Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1
Containers: cilium-envoy Running: 3
hubble-relay Running: 1
cilium-operator Running: 1
cilium Running: 3
hubble-ui Running: 1
Cluster Pods: 5/5 managed by Cilium
Helm chart version:
Image versions cilium quay.io/cilium/cilium:v1.16.0-rc.0@sha256:bc88ac635a871293d5d2837196e53adba1ea55f79cd3f5cba802dd488312fd2a: 3
hubble-ui quay.io/cilium/hubble-ui-backend:v0.13.1@sha256:0e0eed917653441fded4e7cdb096b7be6a3bddded5a2dd10812a27b1fc6ed95b: 1
hubble-ui quay.io/cilium/hubble-ui:v0.13.1@sha256:e2e9313eb7caf64b0061d9da0efbdad59c6c461f6ca1752768942bfeda0796c6: 1
cilium-envoy quay.io/cilium/cilium-envoy:v1.29.5-8fccf45a8ab9da13824e0f14122d5db35673f3bb@sha256:f2c0b275aebe14c7369c8396c4461c787b12b823aba0c613ebbed7a3f92f288e: 3
hubble-relay quay.io/cilium/hubble-relay:v1.16.0-rc.0@sha256:22b7f87db6a7a00d10e4ad8c316324368693b0e7f158055b7f81f39fb27928e2: 1
cilium-operator quay.io/cilium/operator-generic:v1.16.0-rc.0@sha256:78b9951cd6d92e7c954b9d7d2791cf52c83895441147deec3906c03363fd1169: 1
Now we need to define a set of IPs for Loadbalancer ranges, when services are created as type Loadbalancer
or for Gateway API
kubectl apply -f ippool.yaml
With IPv4, cilium automatically figures out the IP of the node to set as Router-id for bgp configuration. With IPv6, its not really automated, so we need to annotate nodes with setting the IPs manually.
kubectl annotate node k3d-cilium-cluster-agent-1 --overwrite cilium.io/bgp-virtual-router.64512="router-id=172.50.0.1"
kubectl annotate node k3d-cilium-cluster-agent-0 --overwrite cilium.io/bgp-virtual-router.64512="router-id=172.50.0.1"
kubectl annotate node k3d-cilium-cluster-server-0 --overwrite cilium.io/bgp-virtual-router.64512="router-id=172.50.0.1"
Now to use the BGP Control Plane v2, Instead of a single BGP Peering Policy, we need to define a couple of Kubernetes Objects. You can find more information here on how it works
So we need a couple of configurations
This is the BGP cluster config, here we define the label for nodes to apply the rule. and bgpInstances. The peerAddress
is the IPv6 gateway of the docker network and this clusterConfig
refers to a peerConfig
, which is described below.
Now with v2 configuration we can apply separate rules or different configuration for BGP within a single cluster which was not possible with v1 configuration.
The Peer config references BGP Advertisement config which is selected with a labelSelector
This config will advertise routes for all services that have Loadbalancer type. I have disabled the selector, because I want to deploy a Gateway API and Ineed the routes be advertised. I could also add a label to the Gateway service, but I am playing around :D
Lets deploy the config
kubectl apply -f bgp-config.yaml
To test, we will create a new service pointing to the hubble service and then create a Gateway and add an HTTP route to it
Note that I have added a selector for Gateway to be applied to namespaces with shared-gw: true
label. So now we need to annotate it and apply the config
kubectl label namespace kube-system shared-gw=true
kubectl apply -f hubble.yaml
lets now check Gateway, HTTPRoute and services
kubectl -n kube-system get gateway,svc,httproute
NAME CLASS ADDRESS PROGRAMMED AGE
gateway.gateway.networking.k8s.io/shared-gw-ipv6 cilium 2004::1 True 13m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kube-dns ClusterIP 1001:cafe:43:9::a <none> 53/UDP,53/TCP,9153/TCP 96m
service/metrics-server ClusterIP 1001:cafe:43:9::be <none> 443/TCP 96m
service/cilium-envoy ClusterIP None <none> 9964/TCP 69m
service/hubble-peer ClusterIP 1001:cafe:43:9::87 <none> 443/TCP 69m
service/hubble-relay ClusterIP 1001:cafe:43:9::cf <none> 80/TCP 69m
service/hubble-ui ClusterIP 1001:cafe:43:9::80 <none> 80/TCP 69m
service/hubble-ipv6 ClusterIP 1001:cafe:43:9::a1 <none> 80/TCP 13m
service/cilium-gateway-shared-gw-ipv6 LoadBalancer 1001:cafe:43:9::cc 2004::1 80:30604/TCP 13m
NAME HOSTNAMES AGE
httproute.gateway.networking.k8s.io/hubble-ipv6-gw ["hubble.example.com"] 13m
As you can see, the gateway has one Loadbalancer IP which was from the assigned Loadbalancer IPPools which we added earlier and the same IP can be seen coming from the service. Also the httproute has got a DNS name assigned to it
Lets install FRR for advertising and receiving routes.
sudo apt install frr -y
Enable BGP for frr, by editing /etc/frr/daemons
and change the value of bgpd
from no
to yes
Edit the file /etc/frr/frr.conf
and add the following config
- We are setting the localASN number as
64513
, because we used this number as the peerASN in cilium - We set the peer router-id as the gateway IP of the docker network created for the cluster
- We define each node in the cluster as neighbour in the configuration to receive announcements
Now start frr
sudo systemctl start frr
After a couple of seconds you can check the status in both cilium and frr for peering status
cilium bgp peers
Node Local AS Peer AS Peer Address Session State Uptime Family Received Advertised
k3d-cilium-cluster-agent-0 64512 64513 2001:3200:3200::1 established 15m38s ipv6/unicast 1 2
k3d-cilium-cluster-agent-1 64512 64513 2001:3200:3200::1 established 15m28s ipv6/unicast 1 2
k3d-cilium-cluster-server-0 64512 64513 2001:3200:3200::1 established 15m38s ipv6/unicast 1 2
sudo vtysh -c "show bgp summary"
IPv4 Unicast Summary:
BGP router identifier 172.50.0.1, local AS number 64513 VRF default vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 3, using 60 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
2001:3200:3200::2 4 64512 35 35 0 0 0 00:15:56 NoNeg NoNeg N/A
2001:3200:3200::3 4 64512 35 35 0 0 0 00:15:56 NoNeg NoNeg N/A
2001:3200:3200::4 4 64512 35 35 0 0 0 00:15:46 NoNeg NoNeg N/A
Total number of neighbors 3
IPv6 Unicast Summary:
BGP router identifier 172.50.0.1, local AS number 64513 VRF default vrf-id 0
BGP table version 1
RIB entries 1, using 96 bytes of memory
Peers 3, using 60 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
2001:3200:3200::2 4 64512 35 35 1 0 0 00:15:56 1 1 N/A
2001:3200:3200::3 4 64512 35 35 1 0 0 00:15:56 1 1 N/A
2001:3200:3200::4 4 64512 35 35 1 0 0 00:15:46 1 1 N/A
Total number of neighbors 3
We can see that the routes are being advertised and received. If there is something wrong with the connectivity, we can see the advertised and recevied routes to be 0 from cilium-cli and Up/Down status will show a “Never” state.
Lets check the routes, to see if FRR is creating proper routes:
ip -6 r s
::1 dev lo proto kernel metric 256 pref medium
2001:db8::/64 dev docker0 metric 1024 linkdown pref medium
2001:3200::/64 dev docker0 proto kernel metric 256 linkdown pref medium
2001:3200::/64 dev docker0 metric 1024 linkdown pref medium
2001:3200:3200::/64 dev br-455d8ca1d613 proto kernel metric 256 pref medium
2001:3200:3200::/64 dev docker0 metric 1024 linkdown pref medium
2001:3200:3200::/56 dev docker0 metric 1024 linkdown pref medium
2004::1 nhid 32 proto bgp metric 20 pref medium
nexthop via 2001:3200:3200::4 dev br-455d8ca1d613 weight 1
nexthop via 2001:3200:3200::2 dev br-455d8ca1d613 weight 1
nexthop via 2001:3200:3200::3 dev br-455d8ca1d613 weight 1
Now lets do a curl to see if we can view the page
curl --connect-to 'hubble.example.com:80:[2004::1]:80' http://hubble.example.com
<!doctype html><html><head><meta charset="utf-8"/><title>Hubble UI</title><base href="/"/><meta name="color-scheme" content="only light"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width,user-scalable=0,initial-scale=1,minimum-scale=1,maximum-scale=1"/><link rel="icon" type="image/png" sizes="32x32" href="favicon-32x32.png"/><link rel="icon" type="image/png" sizes="16x16" href="favicon-16x16.png"/><link rel="shortcut icon" href="favicon.ico"/><script defer="defer" src="bundle.main.eae50800ddcd18c25e9e.js"></script><link href="bundle.main.1d051ccbd0f5cd57832e.css" rel="stylesheet"></head><body><div id="app" class="test"></div></body></html>
HAProxy Setup
Lets try to make it DNS resolvable. One way to do it is to update the domain AAAA record with the IP address of the server and use HAProxy to route the request to backend Loadbalancer IP that is advertised. If done properly, with support from ISP, then we can actually use publicly routable IP address range and use it, and run something else, other than docker, like testing on a proper cluster.
When running on cloud, we can make use of external DNS to update the records. In local setups, HAProxy is king :D
So lets test with HAProxy
sudo apt install haproxy -y
Now replace /etc/haproxy/haproxy.cfg
file with the following config
Here we will then choose to update the DNS to point to the public IPv6 address of the server or machine that we are running. However, to make it more simple, I will add to my hosts file, so I can check on my browser to confirm it works.
Now start HAProxy
sudo systemctl start haproxy
On my local machine, I will add to my hosts file
echo "xxxx:xxxx:xxxx:xxxx:0000:0000:0000:0001 hubble.example.com" | sudo tee -a /etc/hosts>/dev/null
Go to browser and go to the domain “hubble.example.com”
Please note that the domain name I used is a test one, please replace it with your own domain name. The domain name from registrar and on the http-route should match
I also tested with and without BigTCP enabled and the performance was comparatively better on random test. I cannot test it properly unless I have a proper setup. To learn more about BigTCP, you can check it here. Its an amazing article to read about it. For a concise information, BigTCP enables CNI to perform faster by sending data with bigger sizes than default 64kb without any change in network devices, where JumboFrames, also another way for faster performance requires physical network devices that supports JumboFrames
Conclusion
Cilium as an excellent CNI tool is still proving to be well efficient and with amazing performance and features. Its worth to explore working with pure IPv6 stack and have fun in learning about the limitations that it brings and how to overcome it.
I hope you enjoyed this blog,