Calico and wrong iptables

2023-06-19

I needed to re-install couple servers to get there a newer ubuntu version (20.04 -> 22.04) because of kubelet bug we had on couple nodes.

Because of my lazy ass, I avoided ansible and correct proccess and I performed just an dist-upgrade on servers. It worked. I registered worker nodes back to cluster, uncordoned them and called it a day.

A bit later, on the upgraded nodes, I noticed some network issues about unreachable kubernetes API.

But why the hell, a pod would not reach kubernetes API? I did exec to random healthy pod and noticed that pod networking did not work at all. Pods could not reach any external nor internal services.

Clearly, the network was the culprit. The host network worked and also calico was perfectly healthy. Nothing in the system was pointing at pod network failure.

After some time of poking to the system, I noticed empty iptables on the server. This is wierd because Calico is responsible for managing IP tables and according to logs, everything seemed functional. No errors, just INFO logs saying that it’s doing its job as it should. But still, despite log messages, tables were empty.

When I filtered calico info logs that contained iptables string, this info log line caught my eye:

Looked up iptables command backendMode="iptables-legacy-save" candidates=[]string{"iptables-legacy-save", "iptables-save"}

Interesting. I listed rules from iptables and compared rules with iptables-legacy. WTF, Calico really used iptables backend. Clearly, dist-upgrade of ubuntu was migrated to nftables which is noted also in release notes. First mistake. I did not read it. But still, calico picked a wrong backend candidate for some reason.

I could set FELIX_IPTABLESBACKEND in calico damonset, but I was not comfident all nodes would be cool with this change. Instead I tried to understand how Calico works with tables in the code and I found this:

			if legacyLines >= nftLines {
				detectedBackend = "legacy" // default to legacy mode
			} else {
				detectedBackend = "nft"
			}

Calico picked legacy backend because it contained more rules than nft one despite the underlying system was not using it anymore. Not cool.

I moved rules from one backend to another, restarted calico and networking started to work. If you ask me, I don’t like this behaviour at all.

Anyway, my lesson here was an importance of properly reading and release notes and always think about how can a change interfere other systems.

Most importantly, once again, I learned that upgrades suck. Just kill it and install it from scratch. Cattle won against pets.