nix

Cilium

To do:

Networking stuff

BGP troubleshooting

OPNsense config file location /usr/local/etc/frr/frr.conf OPNsense run BGP commands vtysh -c 'show bgp summary'. Other commands include: debug bgp updates, show ip bgp neighbor

Convenience commands:

# See what Cilium thinks should be peering
cilium bgp peers
# Obvs
cilium bgp routes available ipv6 unicast
cilium bgp routes advertised ipv6 unicast
# Check CR statuses for errrors, general misconfig
kd CiliumBGPClusterConfig cilium-bgp
kd CiliumBGPPeerConfig primary-router
kd CiliumBGPNodeConfig $n
# Check operator and agent logs
kubectl -n kube-system logs $(kgp -l app.kubernetes.io/name=cilium-operator --no-headers | cut -d' ' -f1) | grep "bgp"
kubectl -n kube-system logs $(kgp -l app.kubernetes.io/name=cilium-agent --no-headers -owide | grep -i $n | cut -d' ' -f1) | grep "bgp"

bgp listen range 2403:580a:e4b1::/48 peer-group LAN bgp listen range 2403:580a:e4b1::/64 peer-group LAN

AFI/SAFI overlap

Error: Configured AFI/SAFIs do not overlap with received MP capabilities

Solution (so far):

References

buncha static IP solutions

Documentation read-through thoughts

Issues

Dynamic router IDs

In single-stack IPv6 Cilium can’t (or won’t) derive unique router IDs for the nodes. The router ID has to be IPv4 format, 0.0.0.0 is disallowed, and they can’t overlap/clash - each node’s router ID has to be unique. Annotating the nodes manually kinda sucks.

Unknown dynamic capability

FRR logs show (on startup/establishment) Unknown capability code - 75 - ignoring (or thereabouts). This looks to be Software Version. Non-critical but annoying.

Ref: https://www.iana.org/assignments/capability-codes/capability-codes.xhtml

Cilium not showing/initiating any BGP peers

When you run cilium bgp peers it’s empty. Checked CiliumBGPNodeConfig and all the nodes are selected so have a CR instance. Checked conditions on the CiliumBGPNodeConfigs and the checks are okay. Cilium-agent however is complaining it doesn’t have a router ID. This was something you could YAML before but CiliumBGPPeeringPolicy which had it is deprecated. We could set it on CiliumBGPNodeConfigOverride, but this seems like a hassle as each CR can only target one node (I think).

Solution: manually add peers to OPNsense as remote-as internal with BFD enabled. Manually annotate nodes with router IDs. kubectl annotate node $n cilium.io/bgp-virtual-router.65551="router-id=192.168.255.1"

PS: the BGP peer group feature on OPNsense doesn’t quite have enough options to deduplicate the config well.

Refs:

Pods unable to reach outside cluster

Turns out return traffic is stuck. Router is sending Neighbor Solicitations for the pod IP but no response. Enabling promiscuous mode on the node interfaces, or assigning the pod IP to the node interface solves this.

Promiscuous mode is not a good idea to leave on, will tax the CPU too much and be noisy. Assigning pod IPs to host interface could be automated but smells.

Solution: Attempt Cilium’s native BGP advertising and l2 neighbor discovery features.

Cyclic dependency with certificates

Attempts to use the in-cluster default service kubernetes fail TLS checks because the IP and name don’t match. Api server certificate needs IP SAN of well-known default service IP from cluster service IP pool. which is in 2 x config files, neither of which support drop-ins iirc.

Solution: For now, manually add the IP when signing the certificate. In future we’ll try signing for DNS that would resolve in-cluster.

Host machine DNS configuration is unsuitable for in-cluster

CoreDNS config file from host needs IPv4 address removed. Tangentially, router ipv6 DNS trapping needs resolution.

Solution: For now, allow the fallback/timeouts to v6. In future, possibly disable IPv4 DNS on the host machines.

Default Kubernetes service not working

Default kubernetes service not working, unclear what the issue is. Could be east-west traffic, could be n-s.

BGP configuration is static

BGP config CR needs router information which is not ideal static. Issue is open to fix this though.

Solution: CIDR is already too hard-coded everywhere, just wait.

Host name resolution

Machines are DHCPv4 and SLAAC, which I’ve worked with by enabling mDNS and Neighbor Solicitation. GoLang, and specifically kubectl uses it’s own DNS resolution chain. This does not include mDNS. I have several possible resolutions to this.

Use machine nftables to DNAT/port forward to a resolver that does do mDNS like resolved. I’m unsure how to write this rule so as not to also trap resolved’s outbound DNS traffic. I would like this eventually though.

Have the machines register their own overrides with router DNS resolver. This means some custom code and service, since I think the OPNsense API does not implement a standard DNS update mechanism, nor does Unbound seem to have or expose an API.

DHCPv4 leases auto registered to router DNS resolver. This used to be the case with ISC DHCPv4 server but that’s deprecated and Kea DHCPv4 doesn’t have it.

Solution: For now I have manually added AAAA record overrides for the machines.

Log pulling

No proxy

Without kube-proxy, kubectl logs wants to go directly to the node and reach 10250. This is not open. I’m assuming that the absence of kube-proxy is causing this, and otherwise it’d go via the API server or kubelets or something.

Solution: open port for ISP-delegated prefix which includes LAN and VPN CIDRs.

Permissions

Even when directly on the node with the pod, kubectl logs fails. Error from server (Forbidden): Forbidden (user=system:node:fat-controller, verb=get, resource=nodes, subresource=proxy) ( pods/log cilium-envoy-w2hhk) It’s unclear why a node wouldn’t be allowed to do this out of the box.

Turns out the API server’s client certificate was set to auth as a node, but it needs admin to be able to proxy.

Solution: Adjust DN on the certificate to system:masters:${NODE_NAME} (the node name is arbitrary but easier for auditing/traceability).

Versions

We’re at Kubernetes 1.31.2, Cillium 1.16 is compatible with 1.30.4 at latest. compatibility table

Solution: Cilium released 1.17.x, just keep rolling. If need be we can pin Kubernetes.

Agent failures

IPv6 issues

failed to start: IPv6 is enabled and ip6tables modules initialization failed: could not load module ip6table_mangle: exit status 1 (try disabling IPv6 in Cilium or loading ip6_tables, ip6table_mangle, ip6table_raw and ip6table_filter kernel modules)\nfailed to stop: context deadline exceeded" subsys=daemon

Solution: Add kernel modules; “ip6table_mangle” “ip6table_raw” “ip6table_filter”.

Missing service account CA

/var/run/secrets/kubernetes.io/serviceaccount/ca.crt is unpopulated, apparently.

Solution: Add --root-ca-certificate to controller manager configuration.

Wrong APIserver networking

Looks like it’s trying to hit the API server on port 443, if this is machine, it’s 6443. But this also smells like it might be an in-cluster IP, set by the service?

Solution: Provide the master address and port to Cillium directly via Helm values, allowing it to bypass the in-cluster service.

$ sudo tail -f cilium-k2mkh_kube-system_config-2d981c7a3a549e59e21b73b99cb911302cf55d22ee00e038abf5815a7277ef91.log
2024-11-30T01:49:50.348745679Z stdout F Running
2024-11-30T01:49:50.350489082Z stderr F E1130 01:49:50.349606       1 config.go:529] Expected to load root CA config from /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, but got err: error creating pool from /var/run/secrets/kubernetes.io/serviceaccount/ca.crt: data does not contain any valid RSA or ECDSA certificates
2024-11-30T01:49:50.350750773Z stderr F 2024/11/30 01:49:50 INFO Starting
2024-11-30T01:49:50.350765231Z stderr F time="2024-11-30T01:49:50Z" level=info msg="Establishing connection to apiserver" host="https://[2403:580a:e4b1:fffd::1]:443" subsys=k8s-client
2024-11-30T01:50:25.360620145Z stderr F time="2024-11-30T01:50:25Z" level=info msg="Establishing connection to apiserver" host="https://[2403:580a:e4b1:fffd::1]:443" subsys=k8s-client
2024-11-30T01:50:55.364900375Z stderr F time="2024-11-30T01:50:55Z" level=error msg="Unable to contact k8s api-server" error="Get \"https://[2403:580a:e4b1:fffd::1]:443/api/v1/namespaces/kube-system\": dial tcp [2403:580a:e4b1:fffd::1]:443: i/o timeout" ipAddr="https://[2403:580a:e4b1:fffd::1]:443" subsys=k8s-client
2024-11-30T01:50:55.364962154Z stderr F 2024/11/30 01:50:55 ERROR Start hook failed function="client.(*compositeClientset).onStart (k8s-client)" error="Get \"https://[2403:580a:e4b1:fffd::1]:443/api/v1/namespaces/kube-system\": dial tcp [2403:580a:e4b1:fffd::1]:443: i/o timeout"
2024-11-30T01:50:55.364976241Z stderr F 2024/11/30 01:50:55 ERROR Start failed error="Get \"https://[2403:580a:e4b1:fffd::1]:443/api/v1/namespaces/kube-system\": dial tcp [2403:580a:e4b1:fffd::1]:443: i/o timeout" duration=1m5.014039638s
2024-11-30T01:50:55.365008111Z stderr F 2024/11/30 01:50:55 INFO Stopping
2024-11-30T01:50:55.365026836Z stderr F Error: Build config failed: failed to start: Get "https://[2403:580a:e4b1:fffd::1]:443/api/v1/namespaces/kube-system": dial tcp [2403:580a:e4b1:fffd::1]:443: i/o timeout

Envoy failures

$sudo tail -f cilium-envoy-nfdbn_kube-system_cilium-envoy-08165263e156b0e0422df322b140d8290ab67c9bd9d18a9ddf8223c9f498a8fb.log
2024-11-30T02:06:52.509632542Z stderr F [2024-11-30 02:06:52.509][7][warning][config] [external/envoy/source/extensions/config_subscription/grpc/grpc_stream.h:193] StreamClusters gRPC config stream to xds-grpc-cilium closed since 1022s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: immediate connect error: No such file or directory

Orphaned pods

"Orphaned pod found, but volumes are not cleaned up" podUID="00fa5fc0-0b96-4090-b0a5-3e3b869b4e3b"

Dual stack

Enabling IPv4 Cilium means dual stack kubernetes, which requires the Kubelet argument --node-ip set. nodeIp is also not part of the Kubelet config spec at v1beta1, so I can’t dynamically put the IP in locally.

Both v4 and v6 are dynamic (DHCPv4 and SLAAC), though I can pin the MAC addresses in NixOS, which nominally makes the v6 address stable, if not entirely static.

I’m also not sure --node-ip could cope with being a v6 in a dual-stack world, you’d think dual stack has been stable since like 1.10 buuuuuut….

It /is/ possible to do this via the API server and resource properties BUT that’s done by a cloud-provider which is orchestrated by cloud-controller-manager. I have been unable to locate much documentation about the API specification for cloud providers beyond GoLang code. I’m not even sure if it’s HTTP, GRPC, or it needs to be in GoLang entirely.

Ref: https://kubernetes.io/docs/concepts/services-networking/dual-stack/ Ref: https://github.com/kubernetes/cloud-provider/blob/master/cloud.go Ref: https://search.nixos.org/options?from=0&size=100&sort=relevance&query=macaddress

Unable to launch pods

NetworkNotReady

Containerd is complaining that there’s nothing in /etc/cni/net.d, hence the containers never start. issue about requiring loopback

Fixed eventually by deploying dummy config, I’m not sure what causes it but it seems it can get “stuck” and you can’t run cilium to create your CNI config but you don’t actually want any other config.

dummy CNI config ref:

{
  "name": "mynet",
  "type": "dummy",
  "ipam": {
    "type": "host-local",
    "subnet": "10.1.2.0/24"
  }
}

script/setup/install-cni:

{
  "cniVersion": "1.0.0",
  "name": "containerd-net",
  "plugins": [
    {
      "type": "bridge",
      "bridge": "cni0",
      "isGateway": true,
      "ipMasq": true,
      "promiscMode": true,
      "ipam": {
        "type": "host-local",
        "ranges": [
          [{
            "subnet": "10.88.0.0/16"
          }],
          [{
            "subnet": "2001:4860:4860::/64"
          }]
        ],
        "routes": [
          { "dst": "0.0.0.0/0" },
          { "dst": "::/0" }
        ]
      }
    },
    {
      "type": "portmap",
      "capabilities": { "portMappings": true }
    }
  ]
}

Unable to mount service account token

MountVolume.SetUp failed for volume "kube-api-access-r6z7b" : object "kube-system"/"kube-root-ca.crt" not registered

Solution: Disabling automatic svc account token mounting removes the error. Manually mounting the token doesn’t yield the error. ref

Misc