Calico and wrong iptables
The Problem
During a routine Ubuntu upgrade from 20.04 to 22.04, I encountered an interesting networking issue in our Kubernetes cluster. While the upgrade itself went smoothly, we started experiencing pod networking problems, particularly with Kubernetes API connectivity.
The Investigation
The symptoms were clear:
- Pods couldn't reach the Kubernetes API
- Internal and external service connectivity was broken
- Host networking remained functional
- Calico logs showed everything was working normally
What made this particularly puzzling was that Calico's logs showed no errors - everything appeared to be functioning as expected. However, when I checked the iptables rules, I found them empty. This was a red flag since Calico is responsible for managing these rules.
The Root Cause
After filtering Calico logs for iptables-related entries, I found this crucial line:
Looked up iptables command backendMode="iptables-legacy-save" candidates=[]string{"iptables-legacy-save", "iptables-save"}
This led to the discovery that Ubuntu 22.04 had migrated to nftables, as documented in the release notes. However, Calico was incorrectly using the legacy iptables backend.
The issue stemmed from Calico's backend detection logic in the code:
if legacyLines >= nftLines {
detectedBackend = "legacy" // default to legacy mode
} else {
detectedBackend = "nft"
}
Calico was choosing the legacy backend because it contained more rules than the nft one, despite the underlying system no longer using it.
This approach of ignoring the system's default iptables backend in favor of legacy mode is surprisingly common among networking solutions. I found the same code in other CNIs and service meshes. To me, it just seems stupid to override the system's default configuration, especially when it can lead to such subtle and hard-to-diagnose issues. But I might be overlooking something, whatever...
The Solution
While we could have set FELIX_IPTABLESBACKEND
in the Calico DaemonSet, I opted for a more targeted approach to avoid potentional issues for other nodes in cluster:
- Moved the rules from the legacy backend to the nft backend
- Restarted Calico
- Verified pod networking functionality
Lessons Learned
-
Always Read Release Notes: The Ubuntu 22.04 release notes clearly mentioned the migration to nftables. This could have been avoided with proper preparation.
-
Upgrade Strategy: While in-place upgrades can work, they often lead to unexpected issues. The "cattle not pets" philosophy holds true - sometimes a clean installation is better than trying to upgrade existing systems.
-
System Dependencies: When upgrading operating systems, it's crucial to consider how the changes might affect other components, especially in complex systems like Kubernetes.
-
Logging Limitations: Even when logs show everything is working, it's important to verify the actual system state. In this case, the logs were misleading while the actual iptables state told the real story.
Conclusion
This incident reinforced the importance of proper system maintenance and upgrade procedures. While quick fixes can work, they often lead to technical debt and unexpected issues. In production environments, it's better to follow proper procedures and documentation rather than taking shortcuts.
Remember: Upgrades can be tricky, and sometimes the best solution is to start fresh. Cattle, not pets.