Troubleshooting networking issues in a Kubernetes cluster can be a challenging task, especially when it comes to capturing TCP packets. From network connectivity problems to latency issues, there are several factors that can impact the performance and stability of your cluster.
In this article, we will provide you with a step-by-step guide on how to capture TCP packets using powerful tools like tcpdump and netstat. These tools allow you to examine the flow of TCP packets within your Azure Kubernetes Service (AKS) cluster, helping you identify and resolve networking issues efficiently.
Traditionally, tools like tcpdump and netstat are interactive, requiring constant monitoring and interaction with the command line. However, we will explore ways to run these tools in the background as part of a script, enabling you to gather data from all the nodes in your AKS cluster effortlessly.
Whether you’re a seasoned Kubernetes expert or just starting your journey, this article will equip you with the knowledge and techniques needed to troubleshoot networking problems in your Azure Kubernetes Service cluster. With that in mind, let’s see how tcpdump and netstat can help unravel networking intricacies in your Kubernetes cluster.
TCP Captures
In this section, we’ll focus on gathering the tcp captures. These can be viewed and interpreted by tools like Wireshark to discover tcp flow issues, among other things.
Create debug namespace
To keep things clean and organized, we create a debug namespace in which we will put all our debug pods that help do the tcp captures.
Get cluster’s external ip
Next, we need to get our cluster’s external ip since we need it further in our scripts:
Find nodes with Nginx
Now that we have the cluster ip, we need to find kubernetes nodes with Nginx pods deployed on them. Why Nginx, because that is what we used as an ingress controller for our cluster. This means that external traffic will hit the nginx pods (meaning it hits nodes which have nginx deployed on them).
Let’s find those nodes:
One thing to note is that while a node can have multiple nginx pods, we only need one ingress pod per node, so we need to break the loop after we find the first one.
Once we have all the nodes, we can start the tcp capture process. We’ll present some fragments of the tcp_capture function and at the end of the section we’ll put all the code. Remember that our tcp_capture function is launched for all nodes in paralel.
Capture the packets
Let’s start by creating a debug pod, inside our debug namespace, that will be used to install and run tcpdump utility. For this we use the kubectl command-line tool.
Once we have the debug pod with the tcpdump utility installed, we need to wait for the capture to finish:
When the sleepTime duration expires, the script will resume execution. At this point we need to stop the tcpdump process that is running on our debugpod, to be able to download the capture file on our local machine:
With the tcpdump process stopped, it’s time to copy the tcp capture file to our local machine, providing a name of our choosing. This will make it easier to organize all the tcp capture files from all the kubernetes nodes:
As seen in the command, the tcp capture will be copied to the local system in the same folder as where you are running the script from.
Cleanup
Once we have the tcp captures on our machine, we have to delete our debug pod. Otherwise, this pod will use additional resources from our cluster:
The last step in this tcp capture guide is to delete the debug namespace from our cluster. This will leave the cluster as we initially found it, before starting this guide.
Netstat
Sometimes tcp captures are not enough to troubleshoot an issue. We can use netstat utility if we need to dig a bit deeper to see if a connection is in a certain state, like ESTABLISHED, TIME_WAIT, FIN_WAIT1 and others. However, netstat utility alone does not give us enough information to see a connection state in a timeline. Let’s see how the initial netstat output looks like and how we can enhance it to have an even better overview of the status of our connections.
Netstat without timestamps
The output of the netstat does indeed give us valuable information on the status of a certain connection if we need it:
However, we have to execute the netstat command manually everytime we want to see the details of our connections. Let’s enhance the output and automate this process so that it helps us even further in our networking troubleshooting.
Netstat with timestamps
Since we want to automate everything we can, let’s create a script that will execute the netstat command and output the result to a file. In this script, before each execution of netstat we will also show a timestamp. This will give us the ability to view when a netstat command was executed.
Great! We have a script that can run the netstat command multiple times and save the output to a file. Do note that before running the netstat command each time, we include a timestamp in the output file.
So, if we examine the output file, we will see a timestamp followed by the netstat command output.
As we scroll down, we’ll encounter another timestamp and its corresponding netstat output, and so on. This gives us a good overview. However, if we want to search for a specific port number using grep, we can find it, but it might be challenging to determine when that output was generated and the connection’s status at that time.
To solve this problem, we need an additional script. The purpose of this new script is to take the output file generated by the previous script and append corresponding timestamps to each line in the file:
If we execute the above script and inspect the output file, we can see something like below:
This output format is much better and also gives us the ability to search for something in the fixed file and see the state of our connection in time. It can help a lot in determining if you have connections that remain for a long time for example in FIN_WAIT1(aka an orphan connection) or other states, and take appropriate actions to fix the issues.
Conclusion
Troubleshooting networking issues in a Kubernetes cluster can be a challenging task, even for experienced users. However, with the aid of tools like tcpdump and netstat, along with some gluing scripts, the process of resolving these issues can be made easier and more efficient.
Below you can find all three scripts used for capturing tcp packets, run netstat and output result to a file, and also the fixing script that adds timestamps for all rows of the netstat output file.